[go: up one dir, main page]

CN119418241A - System and method for extracting highlight video based on automobile field - Google Patents

System and method for extracting highlight video based on automobile field Download PDF

Info

Publication number
CN119418241A
CN119418241A CN202411441413.3A CN202411441413A CN119418241A CN 119418241 A CN119418241 A CN 119418241A CN 202411441413 A CN202411441413 A CN 202411441413A CN 119418241 A CN119418241 A CN 119418241A
Authority
CN
China
Prior art keywords
video
semantic
information
text
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411441413.3A
Other languages
Chinese (zh)
Inventor
郭鹏
李本阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chezhi Interconnection Beijing Technology Co ltd
Original Assignee
Chezhi Interconnection Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chezhi Interconnection Beijing Technology Co ltd filed Critical Chezhi Interconnection Beijing Technology Co ltd
Priority to CN202411441413.3A priority Critical patent/CN119418241A/en
Publication of CN119418241A publication Critical patent/CN119418241A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/771Feature selection, e.g. selecting representative features from a multi-dimensional feature space
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

本发明属于计算机技术领域,且公开了基于汽车领域高光视频的提取系统,包括:文本语义分析模块:即获取整个视频的语义信息。在将视频输入后,通过对视频进行分帧,从而提取图像中的多模态信息和对应图像、语音的原始信息,再通过对视频中的相关标题和正文以及语音和字幕进行转换,从而得到文本语义向量,再通过通用分类模型、通用质量评分模型、通用人脸检测及识别模型和汽车领域的细粒度检测模型,从而能够基于汽车领域的特点形成整个视频的图像语义向量,最后获取视频中的音频数据,提取其特征向量,与图像向量合并,得到视频的语义向量,利用Cross Attention结构使得文本语义向量指导视频语义特征进行预测。

The present invention belongs to the field of computer technology, and discloses an extraction system based on highlight videos in the automotive field, including: a text semantic analysis module: that is, obtaining the semantic information of the entire video. After the video is input, the multimodal information in the image and the original information of the corresponding image and voice are extracted by dividing the video into frames, and then the relevant title and text as well as the voice and subtitles in the video are converted to obtain a text semantic vector, and then through a general classification model, a general quality scoring model, a general face detection and recognition model and a fine-grained detection model in the automotive field, the image semantic vector of the entire video can be formed based on the characteristics of the automotive field, and finally the audio data in the video is obtained, and its feature vector is extracted, and merged with the image vector to obtain the semantic vector of the video, and the Cross Attention structure is used to make the text semantic vector guide the video semantic feature for prediction.

Description

Extraction system and method based on high-light video in automobile field
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a system and a method for extracting a highlight video based on the field of automobiles.
Background
The highlight video refers to particularly attractive and attractive fragments in the video, compared with a medium video or a small video, the long video requires more time and patience to screen content and gradually goes deep into the scenario, in order to optimize the watching experience of users, the efficiency of finding and screening the long video is improved, the function of automatically playing the highlight moment of the long video or directly displaying the highlight fragments is introduced, the measure aims at enabling the users to quickly capture the core charm of the long video in a short time, and the users can have preliminary cognition and interest in the video content without complete watching.
The existing highlight video detection algorithm mainly aims at the topics of movies, synthetic programs, sports events or games, etc., extracts highlight fragments in videos, and the videos of the movies, the synthetic programs, the sports events, etc. are characterized in that no clear topic content exists, and scene changes are relatively large, so that the extractable content range is relatively large, the technical aspect of the existing highlight video extraction algorithm is that firstly, the videos are segmented according to frames, the images, the voices and the characters are combined and input into the corresponding general CNN models, the corresponding vectors are obtained, and then the vectors are input into a fully-connected network aiming at the task, the method has the advantages that although the highlight segments of the video can be detected, the scheme has poor recognition effect on single scenes or videos with little scene content change and cannot refine the subject information of the video, and the CNN classification model is used for acquiring semantic features of the video, interference of other background information is caused, recognition results are affected, so that improvement is needed.
Disclosure of Invention
The invention aims to provide a system and a method for extracting highlight video based on the automotive field, so as to solve the problems in the background technology.
In order to achieve the above purpose, the invention provides the following technical scheme that the system for extracting the highlight video based on the automobile field comprises:
the text semantic analysis module is used for acquiring semantic information of the whole video;
The text semantic analysis module comprises:
The general classification model is used for acquiring basic characteristics such as video edges, geometry and the like;
a general quality scoring model for evaluating the quality of the video;
The general face detection and recognition model is used for detecting and recognizing the faces appearing in the video;
The fine granularity detection model in the automobile field is used for identifying the shooting angle of the automobile in the image and acquiring the middle layer of the neural network corresponding to the detection frame as the semantic vector.
Preferably, the text semantic analysis module obtains the video semantic information from the video related title and text information, the voice in the video and the subtitle in the video, combines the three text information together, obtains the semantic to be expressed of the whole video through a trained language model, and outputs a vector with a specified length.
Preferably, after the general classification model, the general quality scoring model, the general face detection and recognition model and the automobile field fine granularity detection model are combined together, an image semantic vector of the whole video is formed, the audio data in the video is obtained, the feature vector is extracted, and the feature vector is combined with the image vector to obtain the semantic vector of the video.
Preferably, the face detection and recognition model needs to pay attention to the accuracy and robustness of the model, so as to ensure that the model can accurately detect and recognize the face in different environments.
The method for extracting the highlight video based on the automobile field comprises the following specific steps:
Step one, video input
Firstly, receiving a video file as input;
Step two, video framing
Then, the input video is subjected to framing treatment, namely the video is decomposed into a series of static images;
Step three:
3.1 text content extraction
For each decomposed frame, extracting text content in the frame, including subtitles, labels or other text information appearing in the video by the system;
3.2 extracting single frame multimode original information
The system collects multi-mode information and original information of corresponding images and voices for each frame while extracting text content, and the multi-mode information and the original information together form comprehensive description of each frame;
Step four:
4.1 analyzing the Whole video semantics
4.2 Extracting the multimodal semantic information
Step five, model scoring
The text semantic vector is used for guiding the video semantic vector to score, and text information is used for guiding the video semantic feature to predict through Cross Attent i on structures;
Step six, outputting the high-light sheet section
And outputting the highlight segments according to the service characteristics.
Preferably, in the first step, when inputting the video, it is required to confirm that the received video file format can be supported, and when the video format cannot be identified, it is required to convert the video format.
Preferably, the video framing in the second step needs to select a proper framing mode to ensure that key information is not lost due to segmentation of frames, and the influence of the frame rate on subsequent processing needs to be considered, for example, an excessively high frame rate increases processing difficulty and calculation amount.
Preferably, the video frame input by Cross Attent ion in the fifth step is generally characterized by taking the feature of the current frame and 7 frames of the previous and subsequent 3 frames as input, wherein the semantic feature of the video is taken as the main information of the prediction, and the semantic feature of the text is taken as the auxiliary feature.
Preferably, before the final highlight segment is generated in the step six, adjustment or optimization is required according to specific service characteristics or requirements, for example, some services may pay more attention to the originality of the video, and other services may pay more attention to the transmission efficiency of the information, etc.
The beneficial effects of the invention are as follows:
After video is input, multi-mode information and corresponding image and voice original information in an image are extracted through framing the video, relevant titles and texts in the video and voice and subtitle are converted to obtain text semantic vectors, and image semantic vectors of the whole video can be formed based on the characteristics of the automobile field through a general classification model, a general quality scoring model, a general face detection and recognition model and a fine granularity detection model in the automobile field, finally audio data in the video are obtained, the feature vectors are extracted and combined with the image vectors to obtain semantic vectors of the video, a Cross Attention structure is utilized to enable the text semantic vectors to guide video semantic features to predict, so that highlight segments in the video can be accurately predicted, accuracy of video highlight segment prediction is improved, meanwhile, extraction and output of highlight segments are carried out according to service characteristics and user preferences, personalized recommendation is achieved, and accuracy and user satisfaction of recommendation are improved.
Drawings
FIG. 1 is a schematic diagram of the overall process of the present invention;
FIG. 2 is a schematic diagram of an image processing flow according to the present invention;
FIG. 3 is a schematic diagram of the present invention Cross Attent i on.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1 to 3, an embodiment of the present invention provides an extraction system for a highlight video based on an automotive field, including:
the text semantic analysis module is used for acquiring semantic information of the whole video;
The text semantic analysis module comprises:
The general classification model is used for acquiring basic characteristics such as video edges, geometry and the like;
a general quality scoring model for evaluating the quality of the video;
The general face detection and recognition model is used for detecting and recognizing the faces appearing in the video;
The fine granularity detection model in the automobile field is used for identifying the shooting angle of the automobile in the image and acquiring the middle layer of the neural network corresponding to the detection frame as the semantic vector.
The method comprises the steps of using a general classification model to primarily classify or identify content in a video, identifying basic elements in the video, such as objects, scenes and the like, providing a basis for subsequent finer processing, evaluating the quality of video frames by using a quality scoring model, enabling a face detection and identification model to detect and identify faces in the video in real time, providing additional dimensions for video analysis and processing, finally, extracting a detection frame of an automobile from the video frames by using an automobile field fine-granularity detection model, and obtaining an intermediate layer of a neural network corresponding to the detection frame as a semantic vector of the intermediate layer, wherein the semantic vector contains rich information about the automobile, such as automobile type, color, gesture and the like, and providing important basis for subsequent analysis and processing.
The text semantic analysis module obtains the video semantic information from the video related title and text information, the voice in the video and the subtitle in the video, the three text information are combined together, the semantic to be expressed of the whole video is obtained through a trained language model, and the output result is a vector with a specified length.
Speech in video can be converted to text by ASR, while subtitles in video can be converted to text by OCR.
The method comprises the steps of combining a general classification model, a general quality scoring model, a general face detection and recognition model and an automobile field fine granularity detection model together to form an image semantic vector of the whole video, extracting the audio data in the obtained video, extracting the feature vector of the audio data, and combining the feature vector with the image vector to obtain the semantic vector of the video.
The semantic vector of the whole video is obtained by extracting the image semantic vector of the video and the feature vector of the audio data, the semantic vector of the video not only helps to integrate the whole content of the video, but also guides the scoring process, and finally optimizes the output result.
The face detection and recognition model needs to pay attention to the accuracy and robustness of the model, so that the model can accurately detect and recognize the face in different environments.
In practical application, various complicated face conditions, such as shielding, side faces, illumination changes and the like, may be encountered, so the accuracy and the robustness of the face detection and recognition model are very important.
The method for extracting the highlight video based on the automobile field comprises the following specific steps:
Step one, video input
Firstly, receiving a video file as input;
Step two, video framing
Then, the input video is subjected to framing treatment, namely the video is decomposed into a series of static images;
Step three:
3.1 text content extraction
For each decomposed frame, extracting text content in the frame, including subtitles, labels or other text information appearing in the video by the system;
3.2 extracting single frame multimode original information
The system collects multi-mode information and original information of corresponding images and voices for each frame while extracting text content, and the multi-mode information and the original information together form comprehensive description of each frame;
Step four:
4.1 analyzing the Whole video semantics
4.2 Extracting the multimodal semantic information
Step five, model scoring
The text semantic vector is used for guiding the video semantic vector to score, and text information is used for guiding the video semantic feature to predict through Cross Attent i on structures;
Step six, outputting the high-light sheet section
And outputting the highlight segments according to the service characteristics.
After the video is input, framing is carried out on the video, so that multi-mode information in the image and original information of corresponding images and voices are extracted, then semantic content of the whole video is analyzed, so that the theme and the like of the video are understood, a Cross Attent ion structure is utilized, text semantic vectors are used for guiding the video semantic vectors to predict, highlight segments in the video can be accurately predicted, and prediction accuracy rate for the highlight segments is improved.
In the first step, when a video is input, it is required to confirm that the received video file format can be supported, and when the video format cannot be identified, it is required to convert the video format.
By converting the video format when inputting the video, the video can be identified, and the subsequent process can be stably performed.
In the second step, a proper frame dividing mode needs to be selected to ensure that key information is not lost due to frame segmentation, and the influence of the frame rate on subsequent processing needs to be considered, for example, an excessively high frame rate can increase processing difficulty and calculation amount.
By decomposing the video into a plurality of individual frames, the continuous video content can be converted into a discrete sequence of images, thereby providing a basis for subsequent processing.
In the fifth step, the video frame input by Cross Attent ion structures generally takes the characteristics of 7 frames of the current frame and the front and back 3 frames as input, wherein the video semantic characteristics are used as the main information of prediction, and the semantic characteristics of the text are used as auxiliary characteristics.
The more similar the content of the video feature representation and the text feature, the greater the likelihood of being predicted to be a highlight frame.
In the sixth step, before the final highlight segment is generated, adjustment or optimization needs to be performed according to specific service characteristics or requirements, for example, some services may pay more attention to the originality of the video, and other services may pay more attention to the transmission efficiency of the information, etc.
The highlight segments with different focus points are generated according to the characteristics of the service, so that the highlight segments can meet the service requirements more.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (9)

1. Extraction system based on car field highlight video, its characterized in that includes:
the text semantic analysis module is used for acquiring semantic information of the whole video;
The text semantic analysis module comprises:
The general classification model is used for acquiring basic characteristics such as video edges, geometry and the like;
a general quality scoring model for evaluating the quality of the video;
The general face detection and recognition model is used for detecting and recognizing the faces appearing in the video;
The fine granularity detection model in the automobile field is used for identifying the shooting angle of the automobile in the image and acquiring the middle layer of the neural network corresponding to the detection frame as the semantic vector.
2. The extraction system based on the automotive highlight video according to claim 1, wherein the text semantic analysis module obtains the video semantic information from the sources of video related title and text information, voice in the video and subtitle in the video, the three text information are combined together, the semantic meaning to be expressed of the whole video is obtained through a trained language model, and the output result is a vector with a specified length.
3. The system for extracting the highlight video based on the automobile field of claim 1, wherein the general classification model, the general quality scoring model, the general face detection and recognition model and the automobile field fine granularity detection model are combined together to form an image semantic vector of the whole video, audio data in the video are acquired, the feature vector is extracted, and the image semantic vector is combined with the image vector to obtain the semantic vector of the video.
4. The extraction system based on the automotive field highlight video of claim 1, wherein the face detection and recognition model needs to pay attention to the accuracy and robustness of the model, so as to ensure that the model can accurately detect and recognize the face in different environments.
5. The method for extracting the highlight video based on the automotive field is characterized by comprising the following specific steps of:
Step one, video input
Firstly, receiving a video file as input;
Step two, video framing
Then, the input video is subjected to framing treatment, namely the video is decomposed into a series of static images;
Step three:
3.1 text content extraction
For each decomposed frame, extracting text content in the frame, including subtitles, labels or other text information appearing in the video by the system;
3.2 extracting single frame multimode original information
The system collects multi-mode information and original information of corresponding images and voices for each frame while extracting text content, and the multi-mode information and the original information together form comprehensive description of each frame;
Step four:
4.1 analyzing the Whole video semantics
4.2 Extracting the multimodal semantic information
Step five, model scoring
The text semantic vector is used for guiding the video semantic vector to score, and the text information is used for guiding the video semantic feature to predict through a Cross attribute structure;
Step six, outputting the high-light sheet section
And outputting the highlight segments according to the service characteristics.
6. The method for extracting highlight video based on automotive field of claim 5, wherein in the first step, when video is input, it is required to confirm that the received video file format can be supported, and when the video format cannot be identified, it is required to convert the video format.
7. The method for extracting highlight video based on the automotive field as claimed in claim 5, wherein the video framing in the second step is required to select a proper framing mode, so as to ensure that key information is not lost due to segmentation of frames, and the influence of the frame rate on subsequent processing is required to be considered, such as that an excessively high frame rate increases processing difficulty and calculation amount.
8. The method of claim 5, wherein the video frame input by the Cross Attention structure in step five generally takes the feature of 7 frames of the current frame and the 3 frames before and after as input, wherein the semantic feature of the video is taken as the main information of the prediction, and the semantic feature of the text is taken as the auxiliary feature.
9. The method of claim 5, wherein in step six, before the final highlight segment is generated, adjustment or optimization is required according to specific service characteristics or requirements, for example, some services may pay more attention to the originality of the video, and other services may pay more attention to the transmission efficiency of the information, etc.
CN202411441413.3A 2024-10-15 2024-10-15 System and method for extracting highlight video based on automobile field Pending CN119418241A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411441413.3A CN119418241A (en) 2024-10-15 2024-10-15 System and method for extracting highlight video based on automobile field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411441413.3A CN119418241A (en) 2024-10-15 2024-10-15 System and method for extracting highlight video based on automobile field

Publications (1)

Publication Number Publication Date
CN119418241A true CN119418241A (en) 2025-02-11

Family

ID=94458990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411441413.3A Pending CN119418241A (en) 2024-10-15 2024-10-15 System and method for extracting highlight video based on automobile field

Country Status (1)

Country Link
CN (1) CN119418241A (en)

Similar Documents

Publication Publication Date Title
Yang et al. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild
Chung et al. Out of time: automated lip sync in the wild
US10192116B2 (en) Video segmentation
KR100828166B1 (en) Metadata extraction method using voice recognition and subtitle recognition of video, video search method using metadata, and recording media recording the same
CN103488764B (en) Individualized video content recommendation method and system
KR100707189B1 (en) An apparatus and method for detecting advertisements of moving images and a computer-readable recording medium storing computer programs for controlling the apparatus.
CN101616264B (en) News Video Cataloging Method and System
CN114465737B (en) Data processing method and device, computer equipment and storage medium
CN111488487B (en) Advertisement detection method and detection system for all-media data
CN102222227A (en) A system based on video recognition and image extraction
CN112733654B (en) Method and device for splitting video
CN112804580B (en) Video dotting method and device
CN113936236B (en) A video entity relationship and interaction recognition method based on multimodal features
CN115359409A (en) Video splitting method and device, computer equipment and storage medium
Tapu et al. DEEP-HEAR: A multimodal subtitle positioning system dedicated to deaf and hearing-impaired people
CN116916089B (en) Intelligent video editing method integrating voice features and face features
CN118612527A (en) Video summary generation method, device, electronic device, readable storage medium and computer program product
CN116567351B (en) Video processing method, device, equipment and medium
Haloi et al. Unsupervised story segmentation and indexing of broadcast news video
CN119418241A (en) System and method for extracting highlight video based on automobile field
Bechet et al. Detecting person presence in tv shows with linguistic and structural features
Wang et al. Overview of tencent multi-modal ads video understanding
KR102690078B1 (en) Voice and text data generation system
US20250252943A1 (en) System and method of gpt driven cinematic music generation through text processing
CN118918223B (en) Intelligent selection dialogue generating system for animation video frame

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination