Disclosure of Invention
The invention aims to provide a system and a method for extracting highlight video based on the automotive field, so as to solve the problems in the background technology.
In order to achieve the above purpose, the invention provides the following technical scheme that the system for extracting the highlight video based on the automobile field comprises:
the text semantic analysis module is used for acquiring semantic information of the whole video;
The text semantic analysis module comprises:
The general classification model is used for acquiring basic characteristics such as video edges, geometry and the like;
a general quality scoring model for evaluating the quality of the video;
The general face detection and recognition model is used for detecting and recognizing the faces appearing in the video;
The fine granularity detection model in the automobile field is used for identifying the shooting angle of the automobile in the image and acquiring the middle layer of the neural network corresponding to the detection frame as the semantic vector.
Preferably, the text semantic analysis module obtains the video semantic information from the video related title and text information, the voice in the video and the subtitle in the video, combines the three text information together, obtains the semantic to be expressed of the whole video through a trained language model, and outputs a vector with a specified length.
Preferably, after the general classification model, the general quality scoring model, the general face detection and recognition model and the automobile field fine granularity detection model are combined together, an image semantic vector of the whole video is formed, the audio data in the video is obtained, the feature vector is extracted, and the feature vector is combined with the image vector to obtain the semantic vector of the video.
Preferably, the face detection and recognition model needs to pay attention to the accuracy and robustness of the model, so as to ensure that the model can accurately detect and recognize the face in different environments.
The method for extracting the highlight video based on the automobile field comprises the following specific steps:
Step one, video input
Firstly, receiving a video file as input;
Step two, video framing
Then, the input video is subjected to framing treatment, namely the video is decomposed into a series of static images;
Step three:
3.1 text content extraction
For each decomposed frame, extracting text content in the frame, including subtitles, labels or other text information appearing in the video by the system;
3.2 extracting single frame multimode original information
The system collects multi-mode information and original information of corresponding images and voices for each frame while extracting text content, and the multi-mode information and the original information together form comprehensive description of each frame;
Step four:
4.1 analyzing the Whole video semantics
4.2 Extracting the multimodal semantic information
Step five, model scoring
The text semantic vector is used for guiding the video semantic vector to score, and text information is used for guiding the video semantic feature to predict through Cross Attent i on structures;
Step six, outputting the high-light sheet section
And outputting the highlight segments according to the service characteristics.
Preferably, in the first step, when inputting the video, it is required to confirm that the received video file format can be supported, and when the video format cannot be identified, it is required to convert the video format.
Preferably, the video framing in the second step needs to select a proper framing mode to ensure that key information is not lost due to segmentation of frames, and the influence of the frame rate on subsequent processing needs to be considered, for example, an excessively high frame rate increases processing difficulty and calculation amount.
Preferably, the video frame input by Cross Attent ion in the fifth step is generally characterized by taking the feature of the current frame and 7 frames of the previous and subsequent 3 frames as input, wherein the semantic feature of the video is taken as the main information of the prediction, and the semantic feature of the text is taken as the auxiliary feature.
Preferably, before the final highlight segment is generated in the step six, adjustment or optimization is required according to specific service characteristics or requirements, for example, some services may pay more attention to the originality of the video, and other services may pay more attention to the transmission efficiency of the information, etc.
The beneficial effects of the invention are as follows:
After video is input, multi-mode information and corresponding image and voice original information in an image are extracted through framing the video, relevant titles and texts in the video and voice and subtitle are converted to obtain text semantic vectors, and image semantic vectors of the whole video can be formed based on the characteristics of the automobile field through a general classification model, a general quality scoring model, a general face detection and recognition model and a fine granularity detection model in the automobile field, finally audio data in the video are obtained, the feature vectors are extracted and combined with the image vectors to obtain semantic vectors of the video, a Cross Attention structure is utilized to enable the text semantic vectors to guide video semantic features to predict, so that highlight segments in the video can be accurately predicted, accuracy of video highlight segment prediction is improved, meanwhile, extraction and output of highlight segments are carried out according to service characteristics and user preferences, personalized recommendation is achieved, and accuracy and user satisfaction of recommendation are improved.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1 to 3, an embodiment of the present invention provides an extraction system for a highlight video based on an automotive field, including:
the text semantic analysis module is used for acquiring semantic information of the whole video;
The text semantic analysis module comprises:
The general classification model is used for acquiring basic characteristics such as video edges, geometry and the like;
a general quality scoring model for evaluating the quality of the video;
The general face detection and recognition model is used for detecting and recognizing the faces appearing in the video;
The fine granularity detection model in the automobile field is used for identifying the shooting angle of the automobile in the image and acquiring the middle layer of the neural network corresponding to the detection frame as the semantic vector.
The method comprises the steps of using a general classification model to primarily classify or identify content in a video, identifying basic elements in the video, such as objects, scenes and the like, providing a basis for subsequent finer processing, evaluating the quality of video frames by using a quality scoring model, enabling a face detection and identification model to detect and identify faces in the video in real time, providing additional dimensions for video analysis and processing, finally, extracting a detection frame of an automobile from the video frames by using an automobile field fine-granularity detection model, and obtaining an intermediate layer of a neural network corresponding to the detection frame as a semantic vector of the intermediate layer, wherein the semantic vector contains rich information about the automobile, such as automobile type, color, gesture and the like, and providing important basis for subsequent analysis and processing.
The text semantic analysis module obtains the video semantic information from the video related title and text information, the voice in the video and the subtitle in the video, the three text information are combined together, the semantic to be expressed of the whole video is obtained through a trained language model, and the output result is a vector with a specified length.
Speech in video can be converted to text by ASR, while subtitles in video can be converted to text by OCR.
The method comprises the steps of combining a general classification model, a general quality scoring model, a general face detection and recognition model and an automobile field fine granularity detection model together to form an image semantic vector of the whole video, extracting the audio data in the obtained video, extracting the feature vector of the audio data, and combining the feature vector with the image vector to obtain the semantic vector of the video.
The semantic vector of the whole video is obtained by extracting the image semantic vector of the video and the feature vector of the audio data, the semantic vector of the video not only helps to integrate the whole content of the video, but also guides the scoring process, and finally optimizes the output result.
The face detection and recognition model needs to pay attention to the accuracy and robustness of the model, so that the model can accurately detect and recognize the face in different environments.
In practical application, various complicated face conditions, such as shielding, side faces, illumination changes and the like, may be encountered, so the accuracy and the robustness of the face detection and recognition model are very important.
The method for extracting the highlight video based on the automobile field comprises the following specific steps:
Step one, video input
Firstly, receiving a video file as input;
Step two, video framing
Then, the input video is subjected to framing treatment, namely the video is decomposed into a series of static images;
Step three:
3.1 text content extraction
For each decomposed frame, extracting text content in the frame, including subtitles, labels or other text information appearing in the video by the system;
3.2 extracting single frame multimode original information
The system collects multi-mode information and original information of corresponding images and voices for each frame while extracting text content, and the multi-mode information and the original information together form comprehensive description of each frame;
Step four:
4.1 analyzing the Whole video semantics
4.2 Extracting the multimodal semantic information
Step five, model scoring
The text semantic vector is used for guiding the video semantic vector to score, and text information is used for guiding the video semantic feature to predict through Cross Attent i on structures;
Step six, outputting the high-light sheet section
And outputting the highlight segments according to the service characteristics.
After the video is input, framing is carried out on the video, so that multi-mode information in the image and original information of corresponding images and voices are extracted, then semantic content of the whole video is analyzed, so that the theme and the like of the video are understood, a Cross Attent ion structure is utilized, text semantic vectors are used for guiding the video semantic vectors to predict, highlight segments in the video can be accurately predicted, and prediction accuracy rate for the highlight segments is improved.
In the first step, when a video is input, it is required to confirm that the received video file format can be supported, and when the video format cannot be identified, it is required to convert the video format.
By converting the video format when inputting the video, the video can be identified, and the subsequent process can be stably performed.
In the second step, a proper frame dividing mode needs to be selected to ensure that key information is not lost due to frame segmentation, and the influence of the frame rate on subsequent processing needs to be considered, for example, an excessively high frame rate can increase processing difficulty and calculation amount.
By decomposing the video into a plurality of individual frames, the continuous video content can be converted into a discrete sequence of images, thereby providing a basis for subsequent processing.
In the fifth step, the video frame input by Cross Attent ion structures generally takes the characteristics of 7 frames of the current frame and the front and back 3 frames as input, wherein the video semantic characteristics are used as the main information of prediction, and the semantic characteristics of the text are used as auxiliary characteristics.
The more similar the content of the video feature representation and the text feature, the greater the likelihood of being predicted to be a highlight frame.
In the sixth step, before the final highlight segment is generated, adjustment or optimization needs to be performed according to specific service characteristics or requirements, for example, some services may pay more attention to the originality of the video, and other services may pay more attention to the transmission efficiency of the information, etc.
The highlight segments with different focus points are generated according to the characteristics of the service, so that the highlight segments can meet the service requirements more.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.