CN119559477B - Method for reducing multi-mode characteristic quantity of large model based on multi-level coding - Google Patents
Method for reducing multi-mode characteristic quantity of large model based on multi-level coding Download PDFInfo
- Publication number
- CN119559477B CN119559477B CN202510115607.2A CN202510115607A CN119559477B CN 119559477 B CN119559477 B CN 119559477B CN 202510115607 A CN202510115607 A CN 202510115607A CN 119559477 B CN119559477 B CN 119559477B
- Authority
- CN
- China
- Prior art keywords
- feature
- feature map
- attention
- local
- global
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/30—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for reducing multi-mode feature quantity of a large model based on multi-level coding, which relates to the field of computer vision and deep learning and comprises the steps of collecting high-resolution images, preprocessing, carrying out feature extraction and preliminary dimension reduction on the preprocessed high-resolution images, obtaining a preliminary compression feature map, constructing a multi-scale feature pyramid based on the preliminary compression feature map, generating different layers of features, applying a layered attention mechanism to the different layers of features, obtaining feature maps with local and global attention weights, adopting a layered fusion strategy on the feature maps with the local and global attention weights, generating multi-level attention optimization features, carrying out image reconstruction processing on the multi-level attention optimization features, and obtaining the reconstructed feature representation and image. The invention keeps the attention to the global structure when capturing the image, solves the problems of insufficient multi-scale feature fusion and unbalanced application of the local and global attention mechanisms, and realizes more efficient feature representation.
Description
Technical Field
The invention relates to the field of computer vision and deep learning, in particular to a method for reducing multi-modal feature quantity of a large model based on multi-level coding.
Background
In the fields of computer vision and deep learning, processing high resolution images has been a hotspot of research. With the development of convolutional neural networks, particularly the rise of large-scale pre-training models, the multi-modal data processing capability is remarkably improved.
Despite the advances made in the art, there are two major problems, namely inadequate multi-scale feature fusion and inadequate application of local and global attention mechanisms. In particular, most methods perform feature extraction on only a single scale, and cannot fully exploit complementarity between features of different levels.
Disclosure of Invention
The present invention has been made in view of the above-described problems occurring in the prior art.
Therefore, the invention provides a method for reducing the multi-modal feature quantity of a large model based on multi-level coding, which solves the problems of insufficient multi-scale feature fusion and unbalanced application of local and global attention mechanisms in the prior art.
In order to solve the technical problems, the invention provides the following technical scheme:
In a first aspect, the present invention provides a method for reducing large model multi-modal feature based on multi-level encoding, comprising,
Collecting a high-resolution image and preprocessing the high-resolution image;
performing feature extraction and preliminary dimension reduction on the preprocessed high-resolution image to obtain a preliminary compression feature map;
Constructing a multi-scale feature pyramid based on the preliminary compression feature map to generate different layers of features;
Applying a hierarchical attention mechanism to the different hierarchical features to obtain a feature map with local and global attention weights;
adopting a hierarchical fusion strategy to the feature map with the local and global attention weights to generate a multi-level attention optimization feature;
using image reconstruction processing to the multi-level attention optimization features to acquire a reconstructed feature representation and an image;
The method comprises the steps of collecting high-resolution images and preprocessing,
Acquiring a high resolution image using a scanner;
image denoising and standardization processing are carried out on the high-resolution image;
Dividing the high-resolution image subjected to denoising and standardization into a plurality of areas, and obtaining a multi-resolution image block by applying an adaptive resolution selection algorithm;
Encoding and storing the multi-resolution image blocks, creating metadata records for each resolution image block, and generating a preprocessed high-resolution image.
As a preferable scheme of the method for reducing the multi-modal feature quantity of the large model based on multi-level coding, the invention comprises the following specific steps of extracting features and primarily reducing dimensions of a preprocessed high-resolution image to obtain a primarily compressed feature map,
Adjusting the preprocessed high-resolution image into a uniform size, and performing feature extraction by using a deep convolutional neural network to obtain a multi-layer feature map;
Local response normalization is applied to the multilayer feature map, and primary dimension reduction is carried out by using PCA principal component analysis;
And analyzing the feature map after preliminary dimension reduction by the principal component PCA, and further compressing the features by using a lightweight encoder to obtain a preliminary compression feature map.
As a preferable scheme of the method for reducing the multi-modal feature quantity of the large model based on multi-level coding, the invention constructs a multi-scale feature pyramid based on a preliminary compression feature map to generate different layers of features, and comprises the following specific steps,
Unifying the sizes of the preliminary compressed feature graphs, and creating a base layer of a multi-scale feature pyramid based on the compressed feature graphs after unifying the sizes;
Upsampling the base layer feature map using a transpose convolution to construct a high resolution feature map;
Downsampling the base layer feature map by using maximum pooling to construct a low-resolution feature map;
Applying 1x1 convolution to the high-resolution feature map and the low-resolution feature map to adjust the number of channels, adding the feature maps after adjusting the number of channels according to elements, and obtaining a fused multi-scale feature map;
up-sampling the fused multi-scale feature map, adding the multi-scale feature map with corresponding position features of the higher-resolution feature map element by element, and eliminating an aliasing effect by applying a 3x3 convolution layer to generate the multi-scale fused high-resolution feature map;
gradually constructing a complete multi-scale feature pyramid by taking the multi-scale fusion high-resolution feature map at the current stage as the output of the current stage as the input of the next stage;
Feature graphs with different resolutions can be extracted and fused on each level through the multi-scale feature pyramid, and different levels of features from low-level details to high-level semantics can be obtained.
As a preferred scheme of the method for reducing the multi-modal feature quantity of the large model based on multi-level coding, the invention applies a hierarchical attention mechanism to different layers of features to acquire a feature map with local and global attention weights, and comprises the following specific steps of,
Based on different layers of characteristics, a local attention mechanism is designed, importance weights of local areas are extracted by using a convolutional neural network, and a characteristic diagram with the local attention weights is obtained;
Based on different levels of characteristics, a global attention mechanism is designed, a global dependency relationship is captured by using a self-attention mechanism, and a characteristic diagram of global attention weight is obtained;
The feature map of the local attention weight and the feature map of the global attention weight are fused in a weighted summation mode, and the feature map with the local attention weight and the global attention weight is generated.
As a preferable scheme of the method for reducing the multi-modal feature quantity of the large model based on multi-level coding, the invention adopts a hierarchical fusion strategy to generate multi-level attention optimization features by adopting a feature map with local and global attention weights, and comprises the following specific steps of,
Integrating the feature graphs of local and global attention weights through a hierarchical fusion strategy, defining a function for enhancing the expression of local features, and adjusting the importance of the local features through convolution operation and nonlinear transformation;
Defining a function for improving the quality of the global features, and adjusting the importance of the global features through SENet extrusion-excitation networks and GAP global average pooling;
Defining a function for calculating the cross relation between the local feature map and the global feature map, and calculating the cross relation between the local feature map and the global feature map through a multi-head attention mechanism and relative position codes;
Defining a dynamic adjustment function of the overall feature combination, and adjusting weights of the local feature map and the global feature map by using weighted summation;
introducing an exponential decay coefficient, integrating all the characteristics obtained in the steps into a multi-level attention optimizing characteristic diagram, wherein the expression is as follows:
;
wherein, A multi-level attention optimizing feature map is represented,A feature map representing local attention weights,A feature map representing global attention weights,Is the amplitude coefficient of the sinusoidal function,Is the amplitude coefficient of the cosine function,Is used for adjustingIs used to determine the frequency of the sinusoidal function of (c),Is used for adjustingIs used to determine the frequency of the cosine function of (c),Is thatIs set in the first stage of the process,Is thatIs set in the first stage of the process,Is a weight coefficient of the influence of the global feature on the local feature,Is a weighting factor for the effect of a local feature on a global feature,Is the intensity coefficient showing the exponential decay term,A speed parameter of the exponential decay.
As a preferable scheme of the method for reducing the multi-modal feature quantity of the large model based on multi-level coding, the invention uses image reconstruction processing to the multi-level attention optimization feature to obtain the reconstructed feature representation and image, and the specific steps are as follows,
Performing standardization processing on the multi-level attention optimizing feature map, and increasing the space size by using a transposed convolution layer to obtain a processed feature map;
extracting an encoder and a decoder from the pre-trained deep convolutional neural network, introducing jump connection, and splicing the characteristics of the encoder stage hierarchy and the decoder stage hierarchy of the processed characteristic map to obtain the characteristic map with the jump connection;
applying self-adaptive instance normalization processing to the feature map with jump connection, and adjusting the mean value and variance of the feature map;
And further upsampling the feature map subjected to the normalization processing of the self-adaptive example through a plurality of transposed convolution layers to generate a feature map which is close to the resolution of the original input image, acquiring an image with the same spatial dimension as the original input by using a tanh activation function, and integrating the upsampled and activated feature map to obtain a reconstructed feature representation and image.
In a second aspect, the present invention provides a system for reducing large model multi-modal feature based on multi-level encoding, comprising,
The preprocessing module is used for acquiring an original high-resolution image and preprocessing the original high-resolution image;
The dimension reduction module is used for carrying out feature extraction and preliminary dimension reduction on the preprocessed original high-resolution image to obtain a preliminary compression feature map;
Constructing a pyramid module, and constructing a multi-scale feature pyramid based on the preliminary compression feature map to obtain features of different layers;
the attention enhancement module applies a layered attention mechanism to the different layers of features to obtain a feature map with local and global attention weights;
The feature fusion module is used for obtaining multi-level attention optimization features by adopting a hierarchical fusion strategy for the feature map with local and global attention weights;
And the image reconstruction module is used for carrying out image reconstruction processing on the multi-level attention optimization features to obtain a reconstructed feature representation and an image.
In a third aspect, the invention provides a computer device comprising a memory and a processor, the memory storing a computer program, wherein the computer program when executed by the processor implements any of the steps of the method of reducing large model multi-modal feature based on multi-level encoding according to the first aspect of the invention.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements any of the steps of the method for reducing large model multi-modal feature based on multi-level encoding according to the first aspect of the present invention.
The method has the beneficial effects that by constructing the multi-scale feature pyramid and applying a layered attention mechanism, the effective fusion of different layers of features from low-level details to high-level semantics is realized. The method not only can remarkably reduce the characteristic quantity and the calculation complexity, but also can improve the understanding capability of the model to complex scenes. By introducing the joint optimization of the local and global attention mechanisms, the invention can capture the key region in the image and keep the attention to the global structure, thereby solving the problems of insufficient multi-scale feature fusion and unbalanced application of the local and global attention mechanisms in the prior art and realizing more efficient feature representation and better task performance.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for reducing large-model multi-modal feature based on multi-level encoding in embodiment 1;
fig. 2 is a system/schematic diagram of the hierarchical attention mechanism in embodiment 1.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
Embodiment 1, referring to fig. 1 and 2, is a first embodiment of the present invention, and this embodiment provides a method for reducing multi-modal feature of a large model based on multi-level coding, including the following steps:
S1, collecting a high-resolution image, specifically comprising the following steps of,
Configuring an OCR optical character recognition special scanner, and setting scanning parameters including DPI resolution, gray scale, RGB and CMYK color modes;
The original document or object is placed on the scanner, a scanning procedure is started, and a high resolution image is acquired.
S2, preprocessing the high-resolution image, specifically comprising the following steps,
Selecting a proper denoising algorithm according to image characteristics, including Gaussian filtering, bilateral filtering, NLM non-local mean filtering and FFT fast Fourier transform denoising, applying the selected denoising algorithm to the scanned image, removing random noise in the scanned image, improving the image quality and simultaneously keeping the key details and edge definition of the image;
The method comprises the steps of carrying out standardization processing on a high-resolution image subjected to denoising processing, adjusting the brightness and contrast of the image through histogram equalization or CLAHE self-adaptive histogram equalization, so that the visual effect of the image is better, the visibility of image features is enhanced, then carrying out size normalization processing, adjusting all the images to a uniform size, ensuring the consistency of subsequent processing, finally carrying out color space conversion, converting the image from one color space to another color space, simplifying the processing steps, enabling the features to be more obvious, and simultaneously reducing the computational complexity;
the method comprises the steps that a denoising and standardization high-resolution image is divided into a plurality of non-overlapping subareas by adopting a regular grid division and super-pixel segmentation method according to the complexity of the image content, so that the structural information in the image is captured better, the possibility of independent processing is provided for each subarea, and the processing efficiency and the result precision are optimized;
The detail degree and the information quantity of each region are evaluated by using the modes of edge detection and texture analysis, proper resolution is selected for each region according to the evaluation result, so that the important region is ensured to keep enough details, meanwhile, the data quantity of the unimportant region is reduced, each sub-region is resampled according to the selected resolution, a multi-resolution image block is generated, efficient storage and quick transmission are realized, and meanwhile, the overall visual quality of the image is improved;
Selecting a JPEG compression algorithm according to the quality and storage requirement of the multi-resolution image blocks, and encoding each multi-resolution image block to reduce the occupation of storage space;
defining a metadata structure, determining metadata fields to be recorded, creating corresponding metadata records for each image block, ensuring the relevance between the image blocks and related information thereof, providing convenience for image retrieval and management, and supporting long-term storage and future use of image data;
all the coded image blocks and metadata records thereof are integrated together to form a final preprocessed high-resolution image, so that the integrity and consistency of image data are ensured, and further analysis is facilitated.
S3, carrying out feature extraction and preliminary dimension reduction on the preprocessed high-resolution image to obtain a preliminary compression feature map, wherein the specific steps are as follows,
The size of the preprocessed high-resolution image is adjusted through bilinear interpolation, so that all images are ensured to have the same width and height, and subsequent processing is facilitated;
And extracting the characteristics of the high-resolution map with the adjusted size by using a deep convolutional neural network, wherein the method comprises the following specific steps of:
transmitting the high-resolution image with the adjusted size as input to a deep convolutional neural network, transmitting image data layer by layer through each layer in the network, sliding the convolutional layer on the input image by using a plurality of filters to generate a feature map, and capturing visual features of edges, textures and shape types by each filter;
The ReLU nonlinear activation function is applied after each convolution layer, the nonlinear characteristic is introduced, the complex mapping relation can be reflected, the pooling operation is carried out after the convolution layers are activated, the space dimension of the feature map can be reduced, the overfitting is prevented, and the calculated amount is reduced while the important features are maintained;
According to the steps, as the number of layers increases, the early layer mainly captures low-level features of colors and edges, the later layer gradually turns to capture high-level semantic information of object components and categories, and feature maps of different layers reflect different abstraction levels of images;
method of using SkipConnection jump connection by ResNet architecture to let the number of layers come from each layer
The information is directly added, so that the information flow is promoted and the gradient vanishing problem is relieved;
After a series of operations such as convolution, activation, pooling and the like, a multi-layer feature map is generated, wherein the multi-layer feature map contains rich space and semantic information and can be directly used for subsequent tasks;
Performing dimension reduction on the multi-layer feature map by using LRN local response normalization and PCA principal component analysis, wherein the LRN local response normalization enhances larger activation values in the feature map, simultaneously suppresses smaller activation values, achieves the purpose of dimension reduction, improves generalization capability of a model, and achieves the purpose of reducing feature quantity by finding the maximum variance direction of data and projecting the maximum variance direction of the data into a space with lower dimension by PCA principal component analysis, and simultaneously maintains information of original data as much as possible;
The feature map of preliminary dimension reduction is further compressed by using a lightweight encoder, and the specific steps are as follows:
Gradually reducing the space size of the feature map by adopting convolution or maximum pooling operation with the step length of 2 in a downsampling mode, so that the resolution is reduced;
nonlinear characteristics are introduced through BN batch normalization and activation functions, so that a deep convolutional neural network is helped to learn and express complex modes better;
through multi-scale feature fusion and combining feature graphs of different layers, the jump connection ensures that both low-level details and high-level semantic information can be effectively reserved;
Compressing the feature map of each channel into a single numerical value through global averaging pooling to obtain a feature vector with a fixed length, further reducing feature dimensions and simultaneously maintaining important global information;
by bottleneck layers, one or more bottleneck layers are set at the deepest part of the encoder, and the layers have fewer neurons or channels for forcedly compressing the characteristic representation;
after the processing, one or more groups of feature graphs after preliminary compression are generated, and the size and the calculation burden are greatly reduced on the premise of retaining key features, so that the method is suitable for subsequent task processing.
S4, constructing a multi-scale feature pyramid based on the preliminary compression feature map to generate different layers of features, wherein the specific steps are as follows,
Unifying the dimension of the feature map after preliminary compression by a bilinear interpolation method, and directly using the feature map after unifying the dimension as a pyramid base layer;
up-sampling the base layer feature map by using transposed convolution, and setting parameters of the transposed convolution including kernel size, stride, filling and output channel number according to requirements to obtain a high-resolution feature map;
Downsampling the base layer feature map by using maximum pooling, and setting parameters of the maximum pooling, including pooling window size and stride, according to requirements to obtain a low-resolution feature map;
The method comprises the steps of respectively applying 1x1 convolution to a high-resolution feature map and a low-resolution feature map to adjust the number of channels, ensuring that the high-resolution feature map and the low-resolution feature map have the same number of channels, achieving the purpose of changing the depth of the feature map without changing the spatial dimension of the feature map, and adding the feature maps with the number of channels adjusted according to elements to obtain a fused multi-scale feature map;
Up-sampling the fused multi-scale feature map, and adding the up-sampled multi-scale feature map and the features from the corresponding positions of the shallower high-resolution feature map element by element, namely adding each channel value of the two feature maps on the same spatial position, so that the low-level detail information and the high-level semantic information can be effectively combined without increasing the calculation cost;
Applying a convolution layer of 3x3 to the multiscale feature map after up-sampling and element-by-element addition to smooth noise possibly introduced by up-sampling, improve the quality of final output, eliminate chessboard effect possibly occurring in the up-sampling process, and obtain a multiscale fusion high-resolution feature map;
the multiscale fusion high-resolution feature map is used as the output of the current stage and is used as the input of the next stage, so that the information of each layer can be fully utilized, and the more and more abstract features can be captured along with the increase of the pyramid layer number;
Gradually constructing a complete multi-scale feature pyramid according to the principle from coarse to fine, firstly processing a feature map with lower resolution, and then gradually increasing the resolution until the required layer number is reached or the requirement of a specific task is met, so that information of all scales is ensured to be considered;
And extracting feature graphs with different resolutions from the multi-scale feature pyramid by using a bottom-up path and a top-down path, and fusing the extracted feature graphs with different resolutions by adopting a transverse connection method to obtain features with different levels from low-level details to high-level semantics.
S5, applying a hierarchical attention mechanism to the different layers of features to acquire a feature map with local and global attention weights, wherein the specific steps are as follows,
Designing a local attention mechanism, capturing spatial structure information of different layers of features by using a convolutional neural network, obtaining local weight by using a Sigmoid function, limiting the weight within a specific range, ensuring that the importance weight of a generated local area is non-negative, and directly using the local weight as a weighting coefficient;
Applying the importance weight of the local area to the original feature map, and obtaining a feature map with local attention weight by means of element-by-element multiplication of a convolutional neural network;
designing a global attention mechanism to capture long-distance dependency relations among different levels of features, introducing a self-attention mechanism, and converting a feature map into three different vectors, namely query, key and value;
Calculating the similarity between the query and the key to obtain a score matrix, normalizing the score matrix by using a Softmax function to form attention distribution, and carrying out weighted summation on the values according to the attention distribution to obtain a feature map with global attention weight;
The local attention feature map and the global attention feature map are combined, the two feature maps are mixed together according to a certain proportion, a new feature map is created, the local characteristic and the global characteristic are considered in the new feature map, the proportion occupied by the two feature maps is dynamically adjusted in the process of creating the new feature map, and the function of automatically determining whether local information or global information is more important according to the requirements of different tasks is achieved.
S6, adopting a hierarchical fusion strategy to the feature map with local and global attention weights to generate a multi-level attention optimization feature, specifically comprising the following steps,
Integrating feature graphs of local and global attention weights by a hierarchical strategy, comprising:
Defining a function for enhancing the local feature expression, applying convolution operation to the local attention weight of the feature map to capture space structure information, and completing the operation through one or more convolution layers, wherein a nonlinear transformation can be introduced into the back of each convolution layer by using an activation function, and the importance of the local feature is further adjusted by a Sigmoid function of the nonlinear transformation, and the step is output as the enhanced local feature expression;
defining a function for improving the quality of global features, adjusting the importance of the global features by utilizing SENet extrusion-excitation networks and GAP global average pooling, performing global average pooling operation on the global attention weights of the feature graphs to obtain an average value on each channel, constructing an 'excitation' branch through a full connection layer and a ReLU activation function, generating a weight for each channel, and rescaling the corresponding channels in the original feature graphs by the weights, wherein the step is output as the improved global feature quality;
Defining a function for calculating the cross relation between the local feature map and the global feature map, and extracting the query, the key and the value from the local attention weight and the global attention weight of the feature map respectively by adopting a multi-head attention mechanism. Calculating the similarity between the query and the key to obtain a scoring matrix, normalizing the scoring matrix by using a Softmax function to form attention distribution, and carrying out weighted summation on the values according to the attention distribution so as to obtain information with a cross relation, wherein the output of the step is a new feature representing the cross relation between the local feature map and the global feature map;
Defining a dynamic adjustment function of the overall feature combination, and carrying out weight adjustment on the feature map with the local attention weight and the feature map with the global attention weight by using a weighted summation mode to realize dynamic adjustment of the proportion occupied by the two feature maps, wherein the output of the step is the overall feature of the dynamic adjustment function;
introducing an exponential decay coefficient, integrating all the characteristics obtained in the steps into a multi-level attention optimizing characteristic diagram, wherein the expression is as follows:
;
wherein, A multi-level attention optimizing feature map is represented,A feature map representing local attention weights,A feature map representing global attention weights,Is the amplitude coefficient of the sinusoidal function,Is the amplitude coefficient of the cosine function,Is used for adjustingIs used to determine the frequency of the sinusoidal function of (c),Is used for adjustingIs used to determine the frequency of the cosine function of (c),Is thatIs set in the first stage of the process,Is thatIs set in the first stage of the process,Is a weight coefficient of the influence of the global feature on the local feature,Is a weighting factor for the effect of a local feature on a global feature,Is the intensity coefficient showing the exponential decay term,A speed parameter of exponential decay;
the output of the process is a multi-level attention optimizing feature map, the local and global features are comprehensively considered, and dynamic balance of the two features is realized through a mathematical formula, so that the design not only improves the understanding capability of an algorithm on a complex scene, but also effectively reduces the number of large-model multi-mode features, thereby reducing the calculation cost and improving the performance.
S7, using image reconstruction processing to the multi-level attention optimization feature to obtain a reconstructed feature representation and image, specifically comprising the following steps,
The multi-level attention optimization features are standardized by adopting batch normalization, the spatial dimension of the feature map is increased by using a transposition convolution layer, the transposition convolution layer can map low-resolution features back to a high-resolution space through a filter, an activation function is added after the transposition convolution layer, the capture of more complex modes is facilitated, and the processed feature map is obtained;
Capturing high-level abstract features by using an encoder part in a deep convolutional neural network, converting the features into the same representation as the original input by a decoder part, introducing SkipConnections jump connection between the encoder and the decoder to directly transfer low-level features to a decoder layer of a corresponding level in order to keep more detail information, wherein the SkipConnections jump connection effectively solves the gradient disappearance problem in the deep network and promotes the maintenance of fine-grained information, and splicing the features acquired by the encoder stage with the features acquired by the decoder stage to form a feature map with jump connection;
Applying adaptive instance normalization to feature graphs with jump connection, which allows dynamic adjustment of the mean and variance of feature graphs to better match the target distribution, and adjusting the mean and standard deviation to target values by independently calculating the mean and standard deviation for each channel, so that the feature representations among different samples are more consistent, and the robustness and generalization capability of the model are improved;
The feature images after the normalization processing of the self-adaptive examples are sent to a plurality of transposition convolution layers to be up-sampled until the feature images with the same resolution ratio as the original input images are generated, and the output range is limited by using a tanh activation function, so that the color values of the generated images are ensured to be in a reasonable range, the tanh activation function not only limits the output range, but also provides good nonlinear characteristics, and is beneficial to improving the visual quality;
the feature maps obtained by all the above steps are integrated to form the final reconstructed feature representation and image, which process ensures continuity and consistency from the multi-level attention-optimizing feature to the complete image reconstruction.
The embodiment also provides a system for reducing the multi-modal characteristic quantity of the large model based on multi-level coding, which comprises a preprocessing module, a dimension reduction module, a pyramid building module, an attention enhancement module, a characteristic fusion module and an image reconstruction module;
the preprocessing module is used for acquiring and preprocessing the high-resolution image;
the dimension reduction module is used for carrying out feature extraction and preliminary dimension reduction on the preprocessed high-resolution image to obtain a preliminary compression feature map;
a pyramid module is constructed, a multi-scale feature pyramid is constructed based on the preliminary compression feature map, and different layers of features are generated;
the attention enhancement module applies a layered attention mechanism to the different layers of features to acquire a feature map with local and global attention weights;
the feature fusion module is used for generating multi-level attention optimization features by adopting a hierarchical fusion strategy for the feature map with local and global attention weights;
And the image reconstruction module is used for using the multi-level attention optimization features to perform image reconstruction processing to obtain a reconstructed feature representation and an image.
The embodiment also provides a computer device, which is suitable for the situation of the method for reducing the large-model multi-mode characteristic quantity based on the multi-level coding, and comprises a memory and a processor, wherein the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions to realize the method for reducing the large-model multi-mode characteristic quantity based on the multi-level coding, which is provided by the embodiment.
The computer device may be a terminal comprising a processor, a memory, a communication interface, a display screen and input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
The present embodiment also provides a storage medium having a computer program stored thereon, which when executed by a processor implements a method for reducing multi-modal feature of a large model based on multi-level encoding as proposed in the above embodiments, the storage medium may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.
In summary, the invention realizes the effective fusion of different levels of features from low level details to high level semantics by constructing a multi-scale feature pyramid and applying a hierarchical attention mechanism. The method not only can remarkably reduce the characteristic quantity and the calculation complexity, but also can improve the understanding capability of the model to complex scenes. By introducing the joint optimization of the local and global attention mechanisms, the invention can capture the key region in the image and keep the attention to the global structure, thereby solving the problems of insufficient multi-scale feature fusion and unbalanced application of the local and global attention mechanisms in the prior art and realizing more efficient feature representation and better task performance.
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.
Claims (7)
1. A method for reducing multi-modal characteristic quantity of a large model based on multi-level coding is characterized by comprising the following steps of,
Collecting a high-resolution image and preprocessing the high-resolution image;
performing feature extraction and preliminary dimension reduction on the preprocessed high-resolution image to obtain a preliminary compression feature map;
Constructing a multi-scale feature pyramid based on the preliminary compression feature map to generate different layers of features;
Applying a hierarchical attention mechanism to the different hierarchical features to obtain a feature map with local and global attention weights;
adopting a hierarchical fusion strategy to the feature map with the local and global attention weights to generate a multi-level attention optimization feature;
using image reconstruction processing to the multi-level attention optimization features to acquire a reconstructed feature representation and an image;
The method comprises the steps of collecting high-resolution images and preprocessing, wherein the specific steps are as follows,
Acquiring a high resolution image using a scanner;
image denoising and standardization processing are carried out on the high-resolution image;
Dividing the high-resolution image subjected to denoising and standardization into a plurality of areas, and obtaining a multi-resolution image block by applying an adaptive resolution selection algorithm;
Encoding and storing the multi-resolution image blocks, creating metadata records for each resolution image block, and generating a preprocessed high-resolution image;
the features with local and global attention weights are adopted to adopt a hierarchical fusion strategy to generate a multi-level attention optimization feature map, and the specific steps are as follows,
Integrating the feature graphs of local and global attention weights through a hierarchical fusion strategy, defining a function for enhancing the expression of local features, and adjusting the importance of the local features through convolution operation and nonlinear transformation;
Defining a function for improving the quality of the global features, and adjusting the importance of the global features through SENet extrusion-excitation networks and GAP global average pooling;
Defining a function for calculating the cross relation between the local feature map and the global feature map, and calculating the cross relation between the local feature map and the global feature map through a multi-head attention mechanism and relative position codes;
Defining a dynamic adjustment function of the overall feature combination, and adjusting weights of the local feature map and the global feature map by using weighted summation;
introducing an exponential decay coefficient, integrating all the characteristics obtained in the steps into a multi-level attention optimizing characteristic diagram, wherein the expression is as follows:
;
wherein, A multi-level attention optimizing feature map is represented,A feature map representing local attention weights,A feature map representing global attention weights,Is the amplitude coefficient of the sinusoidal function,Is the amplitude coefficient of the cosine function,Is used for adjustingIs used to determine the frequency of the sinusoidal function of (c),Is used for adjustingIs used to determine the frequency of the cosine function of (c),Is thatIs set in the first stage of the process,Is thatIs set in the first stage of the process,Is a weight coefficient of the influence of the global feature on the local feature,Is a weighting factor for the effect of a local feature on a global feature,Is the intensity coefficient showing the exponential decay term,A speed parameter of the exponential decay.
2. The method for reducing multi-modal feature quantity of large model based on multi-level coding as set forth in claim 1, wherein the steps of extracting features and primarily reducing dimensions of the preprocessed high-resolution image, obtaining a primarily compressed feature map are as follows,
Adjusting the preprocessed high-resolution image into a uniform size, and performing feature extraction by using a deep convolutional neural network to obtain a multi-layer feature map;
Local response normalization is applied to the multilayer feature map, and primary dimension reduction is carried out by using PCA principal component analysis;
And analyzing the feature map after preliminary dimension reduction by the principal component PCA, and further compressing the features by using a lightweight encoder to obtain a preliminary compression feature map.
3. The method for reducing multi-modal feature quantity of large model based on multi-level coding as claimed in claim 2, wherein the steps of constructing multi-scale feature pyramid based on preliminary compression feature map to generate different levels of features are as follows,
Unifying the sizes of the preliminary compressed feature graphs, and creating a base layer feature graph of the multi-scale feature pyramid based on the compressed feature graphs after unifying the sizes;
Upsampling the base layer feature map using a transpose convolution to construct a high resolution feature map;
Downsampling the base layer feature map by using maximum pooling to construct a low-resolution feature map;
Applying 1x1 convolution to the high-resolution feature map and the low-resolution feature map to adjust the number of channels, adding the feature maps after adjusting the number of channels according to elements, and obtaining a fused multi-scale feature map;
up-sampling the fused multi-scale feature map, adding the multi-scale feature map with corresponding position features of the higher-resolution feature map element by element, and eliminating an aliasing effect by applying a 3x3 convolution layer to generate the multi-scale fused high-resolution feature map;
gradually constructing a complete multi-scale feature pyramid by taking the multi-scale fusion high-resolution feature map at the current stage as the output of the current stage as the input of the next stage;
and extracting and fusing feature graphs with different resolutions on each level through a multi-scale feature pyramid, and acquiring features of different levels from low-level details to high-level semantics.
4. The method for reducing multi-modal feature values of a large model based on multi-level encoding as claimed in claim 3, wherein the applying of hierarchical attention mechanisms to different levels of features obtains feature maps with local and global attention weights by the steps of,
Based on different layers of characteristics, a local attention mechanism is designed, importance weights of local areas are extracted by using a convolutional neural network, and a characteristic diagram with the local attention weights is obtained;
Based on different levels of characteristics, a global attention mechanism is designed, a global dependency relationship is captured by using a self-attention mechanism, and a characteristic diagram of global attention weight is obtained;
The feature map of the local attention weight and the feature map of the global attention weight are fused in a weighted summation mode, and the feature map with the local attention weight and the global attention weight is generated.
5. The method for reducing multi-modal feature values of a large model based on multi-level encoding as claimed in claim 4, wherein the multi-level attention optimizing feature map is processed by image reconstruction to obtain a reconstructed feature representation and image, comprising the steps of,
Performing standardization processing on the multi-level attention optimizing feature map, and increasing the space size by using a transposed convolution layer to obtain a processed feature map;
extracting encoder and decoder using deep convolutional neural network and introducing jump connection, processing the processed signals
Splicing the characteristics of the encoder stage hierarchy and the decoder stage hierarchy of the characteristic map to obtain the characteristic map with jump connection;
applying self-adaptive instance normalization processing to the feature map with jump connection, and adjusting the mean value and variance of the feature map;
And upsampling the feature map subjected to the normalization processing of the self-adaptive example through a plurality of transposed convolution layers to generate a feature map with the same resolution as the original input image, acquiring an image with the same spatial dimension as the original input by using a tanh activation function, and integrating the upsampled and activated feature map to obtain a reconstructed feature representation and image.
6. A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the computer program is characterized in that the processor executes the steps of the method for reducing the multi-modal feature quantity of the large model based on the multi-level coding according to any one of claims 1 to 5.
7. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor performs the steps of the method for reducing multi-modal feature of a large model based on multi-level encoding as claimed in any one of claims 1 to 5.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510115607.2A CN119559477B (en) | 2025-01-24 | 2025-01-24 | Method for reducing multi-mode characteristic quantity of large model based on multi-level coding |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510115607.2A CN119559477B (en) | 2025-01-24 | 2025-01-24 | Method for reducing multi-mode characteristic quantity of large model based on multi-level coding |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN119559477A CN119559477A (en) | 2025-03-04 |
| CN119559477B true CN119559477B (en) | 2025-04-15 |
Family
ID=94745170
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202510115607.2A Active CN119559477B (en) | 2025-01-24 | 2025-01-24 | Method for reducing multi-mode characteristic quantity of large model based on multi-level coding |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119559477B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119991438B (en) * | 2025-04-14 | 2025-07-18 | 西安圣瞳科技有限公司 | An image processing method and system based on artificial intelligence large model |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114724549A (en) * | 2022-06-09 | 2022-07-08 | 广州声博士声学技术有限公司 | Intelligent identification method, device, equipment and storage medium for environmental noise |
| CN115100470A (en) * | 2022-06-23 | 2022-09-23 | 苏州科技大学 | Small sample image classification system and method |
Family Cites Families (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111192200A (en) * | 2020-01-02 | 2020-05-22 | 南京邮电大学 | Image Super-Resolution Reconstruction Method Based on Residual Network with Fusion Attention Mechanism |
| US12205292B2 (en) * | 2021-07-16 | 2025-01-21 | Huawei Technologies Co., Ltd. | Methods and systems for semantic segmentation of a point cloud |
| CN114612479B (en) * | 2022-02-09 | 2023-03-24 | 苏州大学 | Medical image segmentation method and device based on global and local feature reconstruction network |
| CN115761763A (en) * | 2022-10-08 | 2023-03-07 | 华东师范大学 | A language recognition method and system |
| US12373956B2 (en) * | 2023-04-26 | 2025-07-29 | Mohamed bin Zayed University of Artificial Intelligence | System and method for 3D medical image segmentation |
| CN116758130A (en) * | 2023-06-21 | 2023-09-15 | 安徽理工大学 | A monocular depth prediction method based on multi-path feature extraction and multi-scale feature fusion |
| CN118247808A (en) * | 2024-03-22 | 2024-06-25 | 安徽工业大学 | An improved method for detecting electric vehicle riders' helmet wearing based on YOLOv5 algorithm |
| CN118333857A (en) * | 2024-04-22 | 2024-07-12 | 国网江苏省电力有限公司常州供电分公司 | Lightweight multi-scale image super-resolution reconstruction method |
| CN118864238A (en) * | 2024-07-03 | 2024-10-29 | 南通沃太新能源有限公司 | Aerial image stitching method, medium and equipment based on light enhancement and improved FIST algorithm |
| CN118898864B (en) * | 2024-07-11 | 2025-07-11 | 成都书声琅琅科技有限公司 | Facial expression recognition method for multi-granularity perception and label distribution learning |
| CN118967712A (en) * | 2024-07-24 | 2024-11-15 | 重庆南鹏人工智能科技研究院有限公司 | A 3D brain tumor segmentation model based on deformable feature aggregation |
| CN118552408B (en) * | 2024-07-26 | 2024-12-17 | 泉州装备制造研究所 | Light-weight image super-resolution reconstruction method, system, storage medium and product |
| CN119228824A (en) * | 2024-10-08 | 2024-12-31 | 重庆师范大学 | An efficient skin disease segmentation method based on multi-scale and hybrid attention mechanism |
| CN119128232B (en) * | 2024-11-08 | 2025-02-11 | 江苏南极星新能源技术股份有限公司 | Vehicle-mounted display personalized regulation and control method and system based on deep learning |
-
2025
- 2025-01-24 CN CN202510115607.2A patent/CN119559477B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114724549A (en) * | 2022-06-09 | 2022-07-08 | 广州声博士声学技术有限公司 | Intelligent identification method, device, equipment and storage medium for environmental noise |
| CN115100470A (en) * | 2022-06-23 | 2022-09-23 | 苏州科技大学 | Small sample image classification system and method |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119559477A (en) | 2025-03-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11887218B2 (en) | Image optimization method, apparatus, device and storage medium | |
| CN111507333B (en) | Image correction method and device, electronic equipment and storage medium | |
| CN114565528B (en) | A remote sensing image denoising method and system based on multi-scale and attention mechanism | |
| CN119559477B (en) | Method for reducing multi-mode characteristic quantity of large model based on multi-level coding | |
| Couturier et al. | Image denoising using a deep encoder-decoder network with skip connections | |
| CN114359289B (en) | Image processing method and related device | |
| CN113066017A (en) | An image enhancement method, model training method and device | |
| Zhang et al. | An unsupervised remote sensing single-image super-resolution method based on generative adversarial network | |
| Bastanfard et al. | Toward image super-resolution based on local regression and nonlocal means | |
| CN117576483B (en) | Multisource data fusion ground object classification method based on multiscale convolution self-encoder | |
| CN113066018B (en) | Image enhancement method and related device | |
| CN118608389B (en) | Real-time dynamic super-resolution image reconstruction method and reconstruction system | |
| CN116452930A (en) | Multispectral image fusion method and multispectral image fusion system based on frequency domain enhancement in degradation environment | |
| CN109543685A (en) | Image, semantic dividing method, device and computer equipment | |
| Feng et al. | Guided filter‐based multi‐scale super‐resolution reconstruction | |
| CN118365879A (en) | Heterogeneous remote sensing image segmentation method based on scene perception attention | |
| CN117994133A (en) | License plate image super-resolution reconstruction model construction method and license plate image reconstruction method | |
| CN119785218A (en) | Method and system for extracting buildings from remote sensing images based on local-global features | |
| CN120112938A (en) | Image restoration method, model training method, electronic device and storage medium | |
| CN116703777A (en) | Image processing method, system, storage medium and electronic device | |
| Liu et al. | Gradient prior dilated convolution network for remote sensing image super-resolution | |
| Han | Texture image compression algorithm based on self‐organizing neural network | |
| US20250054115A1 (en) | Deep learning-based high resolution image inpainting | |
| CN117745544A (en) | Image super-resolution method of skin detector | |
| CN116798041A (en) | Image recognition method and device and electronic equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |