CN116071296B

CN116071296B - Model training method and device

Info

Publication number: CN116071296B
Application number: CN202211488759.XA
Authority: CN
Inventors: 戴亚康; 耿辰; 戴斌; 周志勇; 李凤美
Original assignee: Jinan Guoke Medical Engineering Technology Development Co ltd; Suzhou Institute of Biomedical Engineering and Technology of CAS
Current assignee: Jinan Guoke Medical Engineering Technology Development Co ltd; Suzhou Institute of Biomedical Engineering and Technology of CAS
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2025-08-08
Anticipated expiration: 2042-11-25
Also published as: CN116071296A

Abstract

The invention provides a model training method and device, the method comprises the steps of obtaining a plurality of first images, extracting a first object in the first images, obtaining a second object instance marked in the first images as a second object instance label, obtaining the outline of the first object, obtaining the outline of the second object as a second object outline label according to the outline of the first object and the second object instance, inputting the first object into a model to be trained, obtaining the output of the model to be trained, wherein the model to be trained comprises an outline detection branch and an instance detection branch, the output comprises a second object outline output and a second object instance output, calculating a first loss function value and a second loss function value, calculating a third loss function value according to the first loss function value and the second loss function value, and adjusting the parameters of the model to be trained based on the third loss function value. The model trained by the model training method provided by the invention has high detection efficiency.

Description

Model training method and device

Technical Field

The invention relates to the technical field of intelligent detection, in particular to a model training method and device.

Background

At present, a target object detection method based on deep learning is mostly limited to directly taking a complete image block as input of a detection model through pre-processing, and has low detection precision, and the detection area is limited, so that the time of single detection is long, namely the detection efficiency is low.

Disclosure of Invention

Therefore, the invention aims to solve the technical problems of low detection precision and low detection efficiency of the detection of the target object by using the existing detection model in the prior art, thereby providing a model training method and device.

According to a first aspect, an embodiment of the present invention provides a model training method, including the following steps:

acquiring a plurality of first images;

extracting a first object in each first image, and acquiring a second object instance marked from the first image as a second object instance tag;

acquiring the outline of the first object;

Acquiring the outline of the second object according to the outline of the first object and the second object instance, and taking the outline of the second object as a second object outline tag;

The method comprises the steps of inputting a first object to a model to be trained, and obtaining output of the model to be trained, wherein the model to be trained comprises a contour detection branch and an instance detection branch, and the output comprises a second object contour output and a second object instance output;

Calculating a first loss function value based on the second object instance label and the second object instance output, respectively, and calculating a second loss function value based on the second object contour label and the second object contour output;

Calculating a third loss function value from the first loss function value and the second loss function value;

And adjusting parameters of the model to be trained based on the third loss function value.

Optionally, the inputting the first object into the model to be trained includes:

dividing a second object obtained from the first object annotation according to the position of the second object in the first object;

Counting the number of the second objects at each position;

performing a location-balanced amplification on the first object using at least one of flipping along a cross-section, adding discrete gaussian noise, and performing histogram equalization based on the number;

and inputting the first object and the amplified first object into the model to be trained.

Optionally, the model to be trained includes an encoding block, a feature extraction block and a decoding block, and the contour detection branch and the instance detection branch each include the feature extraction block and the decoding block;

the coding block comprises M groups of downsampling structures which are sequentially connected, the M groups of downsampling structures are used for respectively obtaining downsampling results of different scales, the decoding block comprises M groups of upsampling structures which are in one-to-one correspondence with the downsampling structures, and the sampling result of each group of downsampling structures is spliced with the characteristics output by the previous stage structure of the corresponding upsampling structure and then used as the input characteristics of the upsampling structure;

The feature extraction block in the contour detection branch is used for extracting deep features based on the output of the coding block, the feature extraction block in the instance detection branch is used for extracting deep features based on the output of the coding block and the up-sampling result of the intermediate layer up-sampling structure of the decoding block in the contour detection branch, and the decoding block further comprises a classification layer used for performing classification detection based on the output of the up-sampling structure of M groups.

Optionally, each set of downsampling structures in the coding block includes a convolution block and BiA modules connected in sequence, the convolution block is used for downsampling, the BiA module includes two parallel residual branches, the two residual branches are used for decoupling characteristics output by the convolution block, two characteristic diagrams are obtained, and the two characteristic diagrams are respectively input into the decoding block in the contour detection branch and the decoding block in the instance detection branch.

Optionally, each residual branch of the BiA modules comprises two residual sub-modules connected in sequence, the BiA modules further comprise a spatial attention mechanism block, the spatial attention mechanism block comprises a maximum pooling layer and an average pooling layer which are connected in sequence, the input of the spatial attention mechanism block is the output of the convolution block of the same group of downsampling structures, the output of the spatial attention mechanism block obtains a weight graph through a Sigmoid function, and the weight graph is respectively connected with the output of the subsequent residual sub-module of the two residual branches.

Optionally, the feature extraction block comprises a plurality of downsampling layers and a plurality of upsampling layers which are sequentially connected, wherein a Swin-transform layer is further included before each downsampling layer and each upsampling layer, a convolution layer is adopted for splicing the last downsampling layer and the first upsampling layer, and the output of the last downsampling layer and the output of the previous downsampling layer are spliced through short circuit.

Optionally, the up-sampling result of the up-sampling structure of the middle layer of the decoding block in the contour detection branch is down-sampled to the output scale of the encoding block, then a weight map is obtained through a Sigmoid function, and the weight map is multiplied by the output of the encoding block after being added to be used as the input of the feature extraction block of the example detection branch.

Optionally, the first up-sampling result of the decoding block under different scales adjusts the channel number through convolution, and then adds up-sampling result elements with larger scales as a deep supervision of the model to be trained.

Optionally, the first loss function value is calculated using the following formula:

Wherein a is the a-th connected domain, K is the number of the connected domains, p (x _b) is the true value input to the contour detection branch, b is the categories of foreground 1 and background 0 of the true value, q (x _ab) is the predicted value of the contour detection branch, L _CE is a cross entropy loss function, and L _CDE is the loss value of the contour detection branch.

Optionally, the second loss function value is calculated using the following formula:

Where X is a matrix of predicted results in the instance detection branch, Y is a true value entered into the instance detection branch, and L _Dice is a loss value of the instance detection branch.

According to a second aspect, an embodiment of the present invention provides a model training apparatus, including a first acquisition module configured to acquire a plurality of first images;

the processing module is used for extracting a first object in each first image, and acquiring a second object instance marked from the first image as a second object instance tag;

the second acquisition module is used for acquiring the outline of the first object;

the label module is used for acquiring the outline of the second object according to the outline of the first object and the second object instance and taking the outline of the second object as a second object outline label;

The system comprises a first object, a detection module, a training module, a detection module and a control module, wherein the first object is input into a model to be trained to acquire the output of the model to be trained;

a first calculation module for calculating a first loss function value based on the second object instance label and the second object instance output, respectively, and a second loss function value based on the second object contour label and the second object contour output;

A second calculation module for calculating a third loss function value from the first loss function value and the second loss function value;

and the adjusting module is used for adjusting the parameters of the model to be trained based on the third loss function value.

According to a third aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory and the processor are communicatively connected to each other, and the memory stores computer instructions, and the processor executes the computer instructions, thereby performing the model training method described above.

According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium storing computer instructions for causing the computer to perform the above-described model training method.

The technical scheme of the invention has the following advantages:

the method for training the model provided by the embodiment of the invention can train the obtained double-supervision model, and considers that a detection target object, namely a second object, in an image to be detected is closely related to a first object, for example, an aneurysm is abnormal bulge on an arterial wall, so that before the second object of the image is detected by using the double-supervision model, the first object is required to be extracted from the image to be detected, and the first object is input into the double-supervision model to output a detection result. The method has the advantages that long-time operation such as N4 correction is not needed when the first object is preprocessed, the detection efficiency is high, the first image is not needed to be divided into small blocks during detection, the sensitivity is high, and the operations such as image reconstruction and the like are avoided. In addition, the dual-supervision model is used for detecting the first object closely related to the second object, not detecting the whole image content, so that the efficiency and the accuracy are high during detection. The two branches of the double-supervision model respectively extract the outline features and all the features of the second object, detect possible second objects according to the outline features, and further judge whether the possible second objects are actually the second objects or not based on all the features, so that the false detection rate is reduced, and the detection precision is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart showing a specific example of a model training method in embodiment 1 of the present application;

FIG. 2 is a schematic block diagram of a specific example of preprocessing in embodiment 1 of the present application;

FIG. 3 is a schematic block diagram of a specific example of a model to be trained in embodiment 1 of the present application;

FIG. 4 is a schematic block diagram of a model training apparatus according to embodiment 2 of the present application;

fig. 5 is a schematic structural diagram of a specific example of a computer device in embodiment 3 of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected, mechanically connected, electrically connected, directly connected, indirectly connected via an intermediate medium, and in communication with each other between two elements, and wirelessly connected, or wired. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Example 1

The embodiment provides a model training method, as shown in fig. 1, including the following steps:

Step S101, a plurality of first images are acquired. The first image may be an image obtained by Time-Of-Flight vessel imaging (Time Of Flight-Magnetic Resonance Angiography, abbreviated as TOF-MRA), and the TOF-MRA is mainly used for imaging a head vessel. Of course, the first image in this embodiment may be a CT image, a DSA image, or the like, and the first object may be a tissue structure included in the image, or the like. In this embodiment, TOF-MRA images are mainly taken as an example. In this embodiment, the plurality of first images includes TOF-MRA images of the aneurysm. The first image should avoid obtaining images with poor imaging quality and serious artifacts.

Step S102, extracting a first object in the first image for each first image, and acquiring a second object instance marked from the first image as a second object instance label.

The first image also needs to be preprocessed before it is subjected to the first object extraction, said preprocessing comprising data normalization. The voxel intensities of the acquired plurality of first images are normalized to the range of 0-1024, and the origin and voxel space distances (i.e., voxel pitches) of all the first images are unified, which can be set to (0, 0) and (1, 1), respectively. Here, the second object instance may be labeled according to the first image before preprocessing, and in this case, normalization of the origin and voxel distance of the labeled second object instance needs to be performed with reference to normalization of the first image. And (3) carrying out origin and voxel distance standardization on the first image and a second object instance obtained based on the first image annotation, wherein the origin and voxel distance standardization is used for standardizing the differences of voxel ranges in different images caused by different acquisition parameters.

When the first image is a TOF-MRA image, the first object may be a vascular structure and the second object may be a hemangioma, such as an aneurysm, accordingly.

Further, the TOF-MRA image can be segmented by setting a threshold of voxel intensity, extracting a complete vascular structure in the TOF-MRA image, and taking the vascular structure as a first object. The vascular structure is inflated according to the size of the aneurysm or the diameter of the vascular structure, and for smaller aneurysms, the inflation radius can be larger, and for larger aneurysms, the inflation radius can be smaller, so that the extraction result comprises the vascular structure and part of surrounding tissues.

Before the second object instance is acquired, a second object existing in the first image is also noted. In this embodiment, taking the second object as an aneurysm as an example, that is, labeling the aneurysm in the first image, thereby obtaining a second object instance. In this embodiment, the obtained second object instance is used as the second instance tag.

Step S103, acquiring a contour of the first object. As described above, the first object may be a vascular structure, and further, the first image is preprocessed, and as shown in fig. 2, the extracted vascular structure is segmented by using Canny operator processing, so as to obtain a vascular contour. In this embodiment, the contour of the first object may be a contour of a blood vessel.

Step S104, according to the outline of the first object and the second object instance, obtaining the outline of the second object as a second object outline tag.

And (3) performing expansion treatment on the marked second object example according to the size of the aneurysm, wherein the aneurysm is smaller, the expansion radius is larger, the aneurysm is larger, and the expansion radius is smaller. And multiplying the expanded second object instance by the contour of the first object to obtain the contour of the second object, as shown in fig. 2.

As described above, the contour of the first object may be a vessel contour and the second object instance may be a label of the aneurysm in the first image. In this embodiment, the contour of the second object is obtained according to the contour of the first object and the second object instance, that is, the contour of the aneurysm may be obtained according to the blood vessel contour and the aneurysm marking. In this embodiment, the second object is located on the first object. Further, the contour of the second object is taken as a second image contour label.

Step S105, inputting the first object to a model to be trained, and obtaining output of the model to be trained, wherein the model to be trained comprises a contour detection branch and an instance detection branch, the output comprises a second object contour output and a second object instance output, and the instance detection branch detects a second object contour based on contour features extracted by the first object and the contour detection branch.

In this embodiment, the model to be trained may employ a deep convolutional encoder-decoder network structure (CEDNet). And capturing contour and texture feature information of the second object by adopting a CEDNet structure, and executing a detection task of the second object.

In this embodiment, the model to be trained includes two monitoring branches, namely a contour detection branch and an instance detection branch. And taking the second image contour label as a label of the contour detection branch, and taking the second object instance label as a label of the instance detection branch.

Besides the first object, the training sample can be input to the model to be trained, and the normal vascular structure acquired based on other images can be further arranged, so that the problem of high false detection rate after the model is trained is avoided.

The contour feature of the second object (i.e. the contour feature of the vessel in the aneurysm area in the present embodiment) can be extracted by the contour detection branch to output the contour output of the second object, and the extracted contour feature of the second object can be further judged by the instance detection branch to output the instance output of the second object. The second object instance output of the instance detection branch is based on the first object and the outline characteristics of the second object extracted from the outline detection branch, so that the second object is further detected, and the detection result is output as the second object instance.

Step S106, calculating a first loss function value based on the second object instance label and the second object instance output, and calculating a second loss function value based on the second object contour label and the second object contour output.

In this embodiment, the first loss function L _CDE is an improvement based on a cross entropy loss function, and is used to execute the cross entropy loss function at the connected domain level of the prediction result of the model to be trained.

When the first loss function is used, the voxel with probability higher than a threshold value in a probability map obtained by prediction of a model to be trained is set to be 1 to obtain a prediction result, the number of connected domains in the prediction result of the first step is calculated in the second step, when the number of the connected domains is 0 or 1, the first loss function is directly equal to the cross loss entropy function, when the number of the connected domains is greater than 1, the prediction result and the real result are taken as elements, each connected domain in the elements and the results is taken out to be cross entropy with the real value, and finally a series of connected domain cross entropy results obtained in the third step are averaged to obtain the first loss function value.

Step S107, calculating a third loss function value from the first loss function value and the second loss function value.

In this embodiment, the third loss function may be a sum of weights of the first loss function L _CDE and the second loss function L _Dice, that is: Where α is the calculated weight of the first loss function L _CDE, β is the calculated weight of the second loss function L _Dice, and L is the third loss function.

And step S108, adjusting parameters of the model to be trained based on the third loss function value.

If the third loss function value meets the preset threshold requirement, the training is finished, otherwise, the step S107 is carried out after the parameters are adjusted, and the training is continued.

In the embodiment, the acquired multiple first images are subjected to first object extraction to acquire a second object instance marked in the first images, and the contours of the second objects are acquired according to the first object contours and the second object instance. The outline of the second object is taken as a second object outline label, and the second object instance is taken as a second object instance label. Further, the first object is input into the model to be trained, and the second object contour output and the second object instance output are output by utilizing the contour detection branches and the instance detection branches included in the model to be trained. Calculating a first loss function value according to the second object instance label and the second object instance output, calculating a second loss function value according to the second object outline label and the second object outline output, and calculating a third loss function value according to the first loss function value and the second loss function value, thereby adjusting parameters of the model to be trained based on the third loss function value.

The dual-supervision model obtained by training the model training method provided in this embodiment considers that the detection target object, i.e. the second object, in the image to be detected is closely related to the first object, for example, the aneurysm is an abnormal bulge on the arterial wall, so before the second object in the image to be detected using the dual-supervision model is detected, the first object needs to be extracted from the image to be detected, and the first object is input into the dual-supervision model to output a detection result. The method has the advantages that long-time operation such as N4 correction is not needed when the first object is preprocessed, the detection efficiency is high, the first image is not needed to be divided into small blocks during detection, the sensitivity is high, and the operations such as image reconstruction and the like are avoided. In addition, the dual-supervision model is used for detecting the first object closely related to the second object, not detecting the whole image content, so that the efficiency and the accuracy are high during detection. The two branches of the double-supervision model respectively extract the outline features and all the features of the second object, detect possible second objects according to the outline features, and further judge whether the possible second objects are actually the second objects or not based on all the features, so that the false detection rate is reduced, and the detection precision is improved.

As an optional implementation manner, in an embodiment of the present invention, the inputting the first object into the model to be trained includes:

Counting the number of the second objects at each position;

In this embodiment, the first image is a TOF-MRA image. The first object may be a vascular structure, and the position of the aneurysm in the vascular structure is marked to obtain a second object, in this embodiment, the aneurysm. And according to the position of the aneurysm in each vascular structure, dividing the position of the aneurysm into areas, and counting the number of second objects in each area. The first objects are amplified according to the number of each region to equalize the number of the first objects in each region. Finally, the first object and the amplified first object are input into the region to be detected together, 80% of the input image can be selected as a training set of the model to be trained, and the remaining 20% can be used as a verification set of the model to be trained.

As an optional implementation manner, in the embodiment of the present invention, the model to be trained includes an encoding Block (encoding Block), a feature extraction Block (SC Block), and a decoding Block (decoding Block), and the contour detection branch and the instance detection branch each include the feature extraction Block and the decoding Block;

The coding block comprises M groups of downsampling structures which are sequentially connected, the M groups of downsampling structures are used for respectively obtaining downsampling results of different scales, the decoding block comprises M groups of upsampling structures which are in one-to-one correspondence with the downsampling structures, and the sampling results of each group of downsampling structures are spliced with the corresponding characteristics of the previous-stage structure output of the upsampling structures and then used as input characteristics of the upsampling structures.

In this embodiment, taking fig. 3 as an example, the encoding block includes three groups of downsampling structures with different scales connected in sequence, each group of downsampling structures includes two convolution blocks, the decoding block includes three groups of upsampling structures corresponding to the downsampling structures one to one, and each group of upsampling structures includes two convolution blocks. The decoding block has the same structure as the convolution block in the encoding block, but replaces the convolution structure in the encoding block with a deconvolution structure. Each set of downsampling structures in the encoding block further includes BiA modules, each set of downsampling structures outputting two feature maps via BiA modules. One of the feature maps serves as splicing data in the contour detection branch, and the other feature map serves as splicing data of the example detection branch.

The contour detection branch and the instance detection branch both comprise a feature extraction block and a decoding block, the feature extraction block in each branch is sequentially connected with the decoding block, the feature extraction block is used as the previous stage of the decoding block, and the output result of the feature extraction block is used as the input data of the decoding block.

And splicing the sampling result of each group of downsampling structures with the characteristics output by the structure of the previous stage of the corresponding upsampling structure, wherein the previous stage of the upsampling structure corresponding to the first downsampling structure is a characteristic extraction block.

Taking the decoding block in the contour detection branch in fig. 3 as an example, the sampling results of the first set of downsampling structures are d ₁₁ and d ₂₁,d₁₁, which are spliced with the feature extraction block in the contour detection branch to serve as input features of the first set of upsampling structures, the sampling results of the second set of downsampling structures are d ₁₂ and d ₂₂,d₁₂, which are spliced with features output by a previous stage structure of the upsampling structure corresponding to the second set of downsampling structures, which are spliced with features output by the first set of upsampling structures, and so on.

Taking fig. 3 as an example, j in d _1j、d_2j in fig. 3 may be 1, 2, 3, and i in d _i1、d_i2、d_i3 may be 1, 2.

The feature extraction block in the contour detection branch is used for extracting deep features based on the output of the coding block, the feature extraction block in the instance detection branch is used for extracting deep features based on the output of the coding block and the up-sampling result of the intermediate layer up-sampling structure of the decoding block in the contour detection branch, and the decoding block further comprises a classification layer used for performing classification detection based on the output of the up-sampling structure of M groups. Specifically, the classification layer performs classification detection based on the final output of the M groups of up-sampling structures connected in sequence.

In the embodiment, the feature extraction block in the outline detection branch or the feature extraction block in the instance detection branch has the following effects of strengthening the feature extraction capability of the deep network and extracting the abstract features and the image global dependency relationship of the second object of the deep network.

As shown in the decoding block in fig. 3, the decoding block in the contour detection branch further performs downsampling of the intermediate upsampled result, and the downsampled result is further used as an input to the feature extraction block in the example detection branch together with the output of the encoding block, so that the example detection branch further detects the second object in combination with the contour features detected by the contour detection branch.

The decoding blocks in the contour detection branch and the instance detection branch also comprise a classification layer, and the classification layer adopts SoftMax to carry out classification detection on the output of the up-sampling structure in the coding block.

As an optional implementation manner, in an embodiment of the present invention, each set of the downsampling structures in the coding block includes a convolution block and a BiA module that are sequentially connected, where the convolution block is used for downsampling, and the BiA module includes two parallel residual branches, where the two residual branches are used for decoupling the features output by the convolution block, so as to obtain two feature maps, and the two feature maps are respectively input to the decoding block in the contour detection branch.

The convolution block comprises a 3D convolution layer, a batch standardization layer and a ReLU activation layer, namely Conv+BN+ ReLU fusion layer. As shown in fig. 3, each set of downsampling structures includes two convolution blocks, the output result of the first convolution block is taken as the input of the next convolution block, and is spliced with the output result of the next convolution block, and the output result of the last convolution block is taken as the input of the BiA module.

The BiA module includes two residual branches in parallel for decoupling the features of the two dual supervisory branches, the contour detection branch and the instance detection branch.

And outputting two feature graphs by utilizing two residual branches connected in parallel with BiA modules, inputting one feature graph in the two feature graphs into a decoding block in the contour detection branch, splicing the feature graph with the feature output by the previous stage structure of the corresponding up-sampling structure in the decoding block in the contour detection branch, inputting the other feature graph into a decoding block in the example detection branch, and splicing the feature output by the previous stage structure of the corresponding up-sampling structure in the decoding block in the example detection branch.

As an optional implementation manner, in an embodiment of the present invention, each residual branch of the BiA modules includes two residual sub-modules that are sequentially connected;

The BiA module further comprises a spatial attention mechanism block, the spatial attention mechanism block comprises a maximum pooling layer MaxPool and an average pooling layer AveSPool which are sequentially connected, the input of the spatial attention mechanism block is the output of the convolution block of the same group of downsampling structures, the output of the spatial attention mechanism block is obtained into a weight graph through a Sigmoid function, and the weight graph is multiplied by the output of the subsequent residual sub-module of the two residual branches respectively and then added to obtain the two feature graphs. The Sigmoid function is one of SoftMax, and other classification functions can be used for calculation.

In this embodiment, the residual submodule adopts ResNet Block, and the two residual submodules are sequentially connected. As shown in fig. 3, each residual branch includes two residual sub-modules, the output of the first residual sub-module is used as the input of the next residual sub-module, and the output of the last residual sub-module is multiplied by the weight map of the output of the spatial attention mechanical block and then added to obtain the feature map. Here, the output of the spatial attention mechanical block refers to the output of the spatial attention mechanical block after passing through the max pooling layer MaxPool, the average pooling layer AveSPool, and the Sigmoid function. In this embodiment, the use of spatial attention mechanical blocks may enhance the more resolved feature learning capabilities.

As an optional implementation manner, in an embodiment of the present invention, the feature extraction block includes a plurality of downsampling layers and a plurality of upsampling layers that are sequentially connected;

And the front of each downsampling layer and the front of each upsampling layer also comprises a Swin-transform layer, the last downsampling layer and the first upsampling layer are spliced by adopting a convolution layer, and the output of the last downsampling layer and the output of the previous downsampling layer are spliced by short circuit.

In this embodiment, as shown in the feature extraction Block SC Block in fig. 3, both the downsampling layer and the upsampling layer adopt conv+bn+ relu fusion layers, and a Swin-transform layer is further included before each downsampling layer and upsampling layer. And the output of the last downsampling layer is fully connected with the output of the previous downsampling layer. And the last downsampling layer is spliced with the first upsampling layer by using a 1x1 convolution layer.

As an optional implementation manner, in the embodiment of the present invention, after the up-sampling result of the up-sampling structure of the middle layer of the decoding block in the contour detection branch is downsampled to the output scale of the encoding block, a weight map is obtained through a Sigmoid function, and the weight map is added to the output of the encoding block and multiplied to be used as the input of the feature extraction block of the example detection branch.

As described above, the decoding block in the contour detection branch, the intermediate layer up-sampling result thereof, further performs down-sampling, and the down-sampling result is further used as the input of the feature extraction block in the example detection branch together with the output of the encoding block to extract the deep features in the example detection branch. Specifically, after the intermediate layer result of the decoding block in the contour detection branch is downsampled to the output scale of the encoding block, a weight map is obtained through a Sigmoid function, the weight map is multiplied after being added with the output of the encoding block, and the calculated result is taken as the input of the feature extraction block of the example detection branch. The feature extraction block in the instance detection branch extracts deep features based on the input. And mapping the extracted outline features as weights, thereby influencing the extraction of the instance features.

As an alternative implementation manner, in the embodiment of the present invention, the channel number is adjusted by convolution according to the first upsampling result performed by the decoding block under different scales, and then the first upsampling result is added to the upsampling result element with a larger scale, so as to be used as a deep supervision of the model.

In this embodiment, as shown in the Decode Block in fig. 3, the first upsampling result in each upsampling structure in the decoding Block is added to the upsampling result element with a larger scale by adjusting the channel number through 1x1 convolution, so as to realize deep supervision of the model to be trained.

As an alternative implementation manner, in an embodiment of the present invention, the first loss function value is calculated using the following formula:

Where a is the a-th connected domain, K is the number of connected domains, p (x _b) is the true value of the input contour detection branch, b is the category of foreground (or called positive) 1 and background (or called negative) 0 of the true value, in other words, b is equal to 1, which represents the second object, b is equal to 0, which represents the non-second object, q (x _ab) is the predicted value of the contour detection branch, L _CE is the cross entropy loss function, and L _CDE is the loss value of the contour detection branch.

Specifically:

Wherein X is a prediction result matrix of the contour detection branch, Y is a true value, g (x+y) is a result of adding X and Y corresponding elements, th is a preset threshold value of the prediction result of the contour detection branch, and δ is a minimum value.

Q (X _ab) takes the value of the a-th connected domain position in the prediction result matrix of the contour detection branch when the number of connected domains is more than or equal to 1 and the addition result of the corresponding elements of X and Y is more than or equal to the threshold value of the prediction result, and q (X _ab) takes the minimum value delta in the prediction result matrix of the contour detection branch when the number of connected domains is more than or equal to 1 and the addition result of the corresponding elements of X and Y is less than the threshold value of the prediction result.

Regarding the first loss function, when the calculated number of connected domains is 0 or 1, the first loss function is directly calculated by adopting a cross loss entropy function, when the number of the connected domains is greater than 1, the predicted result and the real result are taken as element sums, each connected domain in the element and the result is taken out to be cross entropy with the real value, and finally, a series of obtained cross entropy results of the connected domains are averaged to obtain a first loss function value.

Specifically, L _CDE is an improvement based on a cross entropy loss function, which is L _CE:

Where q (x _b) is the predicted value of the contour detection branch.

The connected domains are positive areas of predicted results in the contour detection branches, and for data input into the contour detection branches, the number of the connected domains is predicted to be more than the number of positive. And b is 1 when the prediction is positive, and is 0 when the prediction is negative, wherein 1 is foreground and 0 is background. The cross entropy of the true and predicted values can be calculated directly using L _CE.

As an alternative implementation, in an embodiment of the present invention, the second loss function value is calculated using the following formula:

And respectively calculating the losses of the contour detection branch and the example detection branch by using the first loss function L _CDE and the second loss function L _Dice of the connected domain cross entropy, and carrying out weighted summation on the two to obtain a complete loss function of the model to be trained. The cross entropy loss function of the connected domain can pay attention to the cross entropy loss of all detection positive areas and labeling areas, so that the sensitivity of the result detected by the model to be trained is higher.

Example 2

The present embodiment provides a model training apparatus, which may be used to perform the model training method in the foregoing embodiment 1, where the apparatus may be disposed inside a server or other devices, and the modules are mutually matched, so as to implement training of a model, as shown in fig. 4, and the apparatus includes:

A first acquisition module 201, configured to acquire a plurality of first images;

A processing module 202, configured to extract, for each of the first images, a first object in the first image, and obtain, as a second object instance tag, a second object instance annotated from the first image;

A second acquiring module 203, configured to acquire a contour of the first object;

A tag module 204, configured to obtain, as a second object profile tag, a profile of the second object according to the profile of the first object and the second object instance;

The detection module 205 is configured to input the first object to a model to be trained, and obtain an output of the model to be trained, where the model to be trained includes a contour detection branch and an instance detection branch, and the output includes a second object contour output and a second object instance output;

A first calculation module 206 for calculating a first loss function value based on the second object instance label and the second object instance output, respectively, and a second loss function value based on the second object contour label and the second object contour output;

A second calculation module 207 for calculating a third loss function value from the first loss function value and the second loss function value;

an adjustment module 208, configured to adjust parameters of the model to be trained based on the third loss function value.

For a specific description of the above device portion, reference may be made to the above method embodiment, and no further description is given here.

Example 3

The present embodiment provides a computer device, as shown in fig. 5, which includes a processor 301 and a memory 302, where the processor 301 and the memory 302 may be connected by a bus or other means, and in fig. 5, the connection is exemplified by a bus.

The processor 301 may be a central processing unit (Central Processing Unit, CPU). The Processor 301 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), graphics processors (Graphics Processing Unit, GPUs), embedded neural network processors (Neural-network Processing Unit, NPUs) or other special purpose deep learning coprocessors, application Specific Integrated Circuits (ASICs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., or a combination of the above.

The memory 302 acts as a non-transitory computer readable storage medium storing non-transitory software programs, non-transitory computer executable programs, and modules, such as the model training method of embodiments of the present invention. Corresponding program instructions/modules. The processor 301 executes various functional applications of the processor and data processing, i.e., implements the model training method in the method embodiments described above, by running non-transitory software programs, instructions, and modules stored in the memory 302.

The memory 302 may also include a storage program area that may store an operating system, application programs required for at least one function, and a storage data area that may store data created by the processor 301, etc. In addition, memory 302 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 302 may optionally include memory located remotely from processor 301, such remote memory being connectable to processor 301 through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The memory 302 stores one or more modules that, when executed by the processor 301, perform the model training method of the embodiment shown in fig. 1.

The details of the above computer device may be understood correspondingly with respect to the corresponding relevant descriptions and effects in the embodiment shown in fig. 1, which are not repeated here.

Embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions that can perform the model training method in any of the above embodiments. The storage medium may be a magnetic disk, an optical disc, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a hard disk (HARD DISK DRIVE, abbreviated as HDD), a Solid state disk (Solid-state disk STATE DRIVE, SSD), or the like, and may further include a combination of the above types of memories.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. A model training method, comprising:

acquiring a plurality of first images;

For each of the first images, extract a first object in the first image; and obtain a second object instance annotated in the first image as a second object instance label;

obtaining a contour of the first object;

Acquire, according to the outline of the first object and the second object instance, the outline of the second object as a second object outline label;

Inputting the first object into a to-be-trained model and obtaining an output of the to-be-trained model; the to-be-trained model includes a contour detection branch and an instance detection branch, and the output includes a second object contour output and a second object instance output; the instance detection branch detects a second object contour based on the first object and contour features extracted by the contour detection branch;

Calculating a first loss function value based on the second object instance label and the second object instance output, and calculating a second loss function value based on the second object contour label and the second object contour output;

Calculate a third loss function value based on the first loss function value and the second loss function value;

Adjusting the parameters of the model to be trained based on the third loss function value;

The first loss function value is calculated using the following formula:

Wherein, a is the ath connected domain, K is the number of connected domains, p(x _b ) is the true value input to the contour detection branch, b is the category of foreground 1 and background 0 of the true value, q(x _ab ) is the predicted value of the contour detection branch, L _CE is the cross entropy loss function, and L _CDE is the loss value of the contour detection branch;

The second loss function value is calculated using the following formula:

Wherein, X is the matrix of the prediction results in the instance detection branch, Y is the true value input into the instance detection branch, and L _Dice is the loss value of the instance detection branch;

The third loss function value is calculated using the following formula:

Among them, α is the calculation weight of the first loss function, and β is the calculation weight of the second loss function.

2. The model training method according to claim 1, wherein inputting the first object into the model to be trained comprises:

dividing the second object obtained by labeling the first object according to the position of the second object in the first object;

Counting the number of the second objects at each position;

Based on the number, performing position-balanced amplification on the first object by using at least one of flipping along a cross section, adding discrete Gaussian noise, and performing histogram equalization;

The first object and the expanded first object are input into the model to be trained.

3. The model training method according to claim 1, wherein the model to be trained includes an encoding block, a feature extraction block, and a decoding block, and the contour detection branch and the instance detection branch both include the feature extraction block and the decoding block;

The encoding block includes M groups of downsampling structures connected in sequence, and the M groups of downsampling structures are used to obtain downsampling results of different scales respectively. The decoding block includes M groups of upsampling structures corresponding to the downsampling structures one by one. The sampling results of each group of the downsampling structures are spliced with the features output by the previous level structure of the corresponding upsampling structure to serve as the input features of the upsampling structure;

The feature extraction block in the contour detection branch is used to extract deep features based on the output of the encoding block, and the feature extraction block in the instance detection branch is used to extract deep features based on the output of the encoding block and the upsampling result of the intermediate layer upsampling structure of the decoding block in the contour detection branch. The decoding block also includes a classification layer, which is used to perform classification detection based on the output of M groups of the upsampling structures.

4. The method according to claim 3 is characterized in that each group of the downsampling structure in the encoding block includes a convolution block and a BiA module connected in sequence, the convolution block is used for downsampling, and the BiA module includes two residual branches connected in parallel, and the two residual branches are used to decouple the features output by the convolution block to obtain two feature maps, and the two feature maps are respectively input into the decoding block in the contour detection branch and the decoding block in the instance detection branch.

5. The method according to claim 4, wherein each residual branch of the BiA module comprises two residual submodules connected in sequence;

The BiA module also includes a spatial attention mechanism block, which includes a maximum pooling layer and an average pooling layer connected in sequence. The input of the spatial attention mechanism block is the output of the convolution block of the same group of downsampling structures. The output of the spatial attention mechanism block obtains a weight map through a Sigmoid function and is respectively combined with the output of the residual sub-module after the two residual branches.

6. The method according to claim 3, wherein the feature extraction block comprises a plurality of downsampling layers and a plurality of upsampling layers connected in sequence;

Each of the downsampling layers and the upsampling layer also includes a Swin-Transformer layer before it, and the last downsampling layer and the first upsampling layer are spliced using a convolutional layer; the output of the last downsampling layer and the output of the previous downsampling layer are spliced by short-circuiting.

7. The method according to claim 3 is characterized in that the upsampling result of the intermediate layer upsampling structure of the decoding block in the contour detection branch is downsampled to the output scale of the encoding block, and then a weight map is obtained by a Sigmoid function, which is added to the output of the encoding block and then multiplied as the input of the feature extraction block of the instance detection branch.

8. The method according to claim 3 is characterized in that the first upsampling results performed by the decoding block at different scales are adjusted for the number of channels through convolution, and then added to the upsampling result elements of the larger scale as deep supervision of the model to be trained.

9. A model training device, comprising:

A first acquisition module, configured to acquire a plurality of first images;

a processing module, configured to extract, for each of the first images, a first object in the first image; and obtain a second object instance annotated in the first image as a second object instance label;

A second acquisition module, configured to acquire the outline of the first object;

a labeling module, configured to obtain, according to the outline of the first object and the second object instance, the outline of the second object as a second object outline label;

a detection module, configured to input the first object into a to-be-trained model and obtain an output of the to-be-trained model; the to-be-trained model includes a contour detection branch and an instance detection branch, the output including a second object contour output and a second object instance output; the instance detection branch detects a second object contour based on the first object and contour features extracted by the contour detection branch;

a first calculation module, configured to calculate a first loss function value based on the second object instance label and the second object instance output, and to calculate a second loss function value based on the second object contour label and the second object contour output;

The first loss function value is calculated using the following formula:

The second loss function value is calculated using the following formula:

The second calculation module is used to calculate a third loss function value based on the first loss function value and the second loss function value; the third loss function value is calculated using the following formula:

L＝αL _CDE +βL _Dice

Wherein, α is the calculation weight of the first loss function, and β is the calculation weight of the second loss function;

An adjustment module is used to adjust the parameters of the model to be trained based on the third loss function value.

10. A computer device, comprising:

A memory and a processor, wherein the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the model training method according to any one of claims 1 to 8 by executing the computer instructions.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable the computer to execute the model training method described in any one of claims 1-8.