CN112256913A - Video searching method based on graph model comparison - Google Patents
Video searching method based on graph model comparison Download PDFInfo
- Publication number
- CN112256913A CN112256913A CN202011123040.7A CN202011123040A CN112256913A CN 112256913 A CN112256913 A CN 112256913A CN 202011123040 A CN202011123040 A CN 202011123040A CN 112256913 A CN112256913 A CN 112256913A
- Authority
- CN
- China
- Prior art keywords
- video
- text
- graph
- similarity
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
 
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Library & Information Science (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a video searching method based on graph model comparison, which comprises the steps of converting multimedia resources and text information to be searched into graph models after characters are input and converted by voice, analyzing text details and multimedia resource details, and realizing searching judgment of multimedia videos through similarity comparison of the graph models; the video can be scored more immediately by utilizing the combination of multiple resources, particularly the application of a pilot resource, and the combination of multiple resources corresponds to the design of a network structure, so that the good robustness of the model per se is realized, and the video recommendation system is further optimized by utilizing the advanced resource.
    Description
Technical Field
      The invention relates to the technical field of multimedia resource retrieval, in particular to a video searching method based on graph model comparison.
    Background
      Today, market development is on the rise, and in order to follow up with the demand of consumers, many companies are expecting to provide related intelligent video and music search services. However, as the data volume of network video increases geometrically, how to quickly find a video meeting the user's needs from a huge amount of videos on the internet becomes a troublesome problem. Currently, the main approaches of video search include search engine indirect search and movie client search. The searching method is mainly based on keyword searching of the video such as the film name, the actor, the director and the like, or video searching based on film and television classification, and the displayed unit is also the whole video, such as diversity playing of a television play, a film according to the length of the whole film and the like. The accuracy of the prior art is not ideal, and particularly after the characters are converted from voice input, the multimedia resource search based on the characters can be further optimized.
      The existing intelligent terminal has two existing methods:
      1) fuzzy search matching is directly performed without using an artificial intelligence method, but the method cannot perform search and recommendation through the characteristics (voice and pictures) of multimedia resources, so that the experience of users with the requirements is deteriorated.
      2) And extracting the relevant feature map by simply extracting the character and picture features, and further matching according to the feature map. However, the effect of the method is difficult to further improve, and the accuracy rate display has a large ascending space.
      The two methods have application fields with better effect: for example, when a user searches for a "Langya list" TV play by voice, for fuzzy matching search, the current search method needs to know the name of the TV play, and yell "i want to see Langya list" by voice or type out "i want to see Langya list" in a search box, and the device receives text information, analyzes semantics, and searches for a Langya list corresponding to the TV play; for feature search, the user may only remember a certain feature and a previous screenshot, such as a tv show that a song just played, and when he (she) shouts "i want to see the first tv show that a song played" by voice, the device cannot search for this semantic meaning because it is not understood.
      When the user shouts in voice: "I want to see a white clothing man descending from the day to the night, and a TV show of four seats according to a moment of fright and a moment of fright". For the drama containing related resources, the two methods are insufficient, the first method cannot match the drama, and the second method cannot match action effects.
      Disclosure of Invention
      The present invention is directed to a video searching method based on graph model comparison, which converts multimedia resources and text information to be searched into graph models after text conversion is performed by voice input, so as to analyze text details and multimedia resource details, and implement multimedia video searching and determination through similarity comparison of the graph models. The specific method comprises the following steps: firstly, input voice or text information is preprocessed, and sentences containing language sickness are directly discarded without being processed. The subsequent work is to compare the similarity of the relevant graphs of the text and the video resources to be searched in the database. The text and video composition method will be described in detail below, and is assisted by the relevant neural network. The text information is subjected to syntactic analysis assistance by using an existing tool, and the image network is subjected to target detection assistance by using fast-RCNN, so that a correlation graph is constructed.
      The invention realizes the purpose through the following technical scheme:
      a video searching method based on graph model comparison comprises the following steps:
      inputting related sentences into a Stanford Parser, directly deleting sentences which do not meet the subject + predicate structure in the output sentence components, and simultaneously composing the text information of the result;
      carrying out target detection on each image frame, converting image resources into a subject + predicate form, and then carrying out video information composition on a classification result;
      step 3, carrying out similarity contrast on the text information composition and the video information composition:
      extracting relevant characteristics of words by using the convolutional layer, comparing the characteristics of various categories by using the characteristics, and selecting the character category corresponding to the most similar characteristic diagram as a classification result;
      and (4) adopting a similarity comparison method for each image in the text and the database, and finally selecting the image with the highest similarity as a search result.
      Further, the text message composition in the step  1 is to train the Stanford Parser through a related data set.
      In a further aspect, the video information composition in step  2 is to train the fast-RCNN through a related data set, and in step 3, the similarity comparison specifically includes:
      1) when the similarity comparison is carried out, all video resources are stored in a database for searching, and the text resources are used as single search input; wherein, regarding the same subjects and corresponding predicates, the nodes can be identified;
      2) for the two formed graphs, preprocessing the graph formed by the text; for each node, the related subject is classified, the classification category is the same as the subject category preset by the video resource, and the same processing is carried out on the action, so that the two graphs can be corresponding to the same dimension; the specific classification method is to directly use the convolution layer to extract the relevant characteristics of the words and compare the characteristics of each category by using the characteristics. Selecting the character category corresponding to the most similar characteristic graph as a classification result;
      3) for the graph after preprocessing, the information contained in one node is completely the same and can be considered as the same node, the length of the edge connected between each node is set to be 1, and the distance of a plurality of nodes is understood as distance accumulation;
      4) then, the whole similarity comparison problem can obtain a result by utilizing an approximate algorithm of a quadratic programming problem; the similarity comparison method is adopted for each graph in the text and the database, and finally the graph with the highest similarity is selected as a search result.
      The invention has the beneficial effects that:
      by utilizing the combined utilization of multiple resources, particularly the application of a pilot resource, the video can be scored more immediately; the combination of various resources corresponds to the design of a network structure, so that the good robustness of the model is realized, and the video recommendation system is further optimized by utilizing the advanced resources.
    Drawings
      In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the following briefly introduces the embodiments or the drawings needed to be practical in the prior art description, and obviously, the drawings in the following description are only some embodiments of the embodiments, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
      FIG. 1 is an example of a composition model of textual information according to the present invention.
      Fig. 2 is an example of a composition model of video information according to the present invention.
    Detailed Description
      In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
      In any embodiment, as shown in fig. 1-2, a video search method based on graph model comparison according to the present invention includes:
      1 composition of text information:
      the sentence composition structure of human natural language is mostly: "subject + predicate", thereby simplifying understanding that someone does something. The current text-based information searching method is based on the point, video resources and texts to be searched are converted into the graph model structure according to the time sequence, and the method for searching videos is realized through the similarity comparison of the graph models.
      For text-related information, the Stanford Parse tool is used to extract attribute components of related components in a sentence (the tool can be considered as well-known). For example, in the sentence "a white-skinned man descends from the day, and then blinked and flash four seats" a white-skinned man is regarded as the subject, and descends from the day as the action, and then the frightened and flashed four seats are also regarded as the action. The subject labels as class I nodes and the actions are considered class II nodes. For the action generated by the class I node, an edge is connected between the class I node and the class II node. If a plurality of subjects are included in a sentence, after the action analysis of each subject is completed, a node is newly added as a parent node of the class I node as a root, and thus all graphs are trees with the root of 1.
      The method is used for drawing each sentence, and the text information is converted into a plurality of trees with roots of only 1. And for the actions which are connected in the front-back time in the original text, connecting an edge on the corresponding root node. For example, in the above sentence, if the latter sentence is "the Duke jumping on the seat", the answer is to be made. Connecting a line between the nodes of the duke on the seat and the nodes of the white clothes man, and completing the composition based on the text.
      Composition of video information:
      the composition of the video information will be assisted by the network fast-RCNN for object detection, which will identify the relevant people and actions in each frame. For example, if a man in a pair of graphs is tapping, then the fast-RCNN will extract the relevant information: "man" and "kowtow head". The specific personnel classification and actions will be preset in advance. Composition is not repeated for frames in which the motion and the person are the same. At the same time, the subject nodes and the action nodes are connected by imitating the graph-constructing method of the text information. Meanwhile, one node is used as a root for a plurality of subject nodes in the same frame. The nodes which are connected in time sequence are connected in front and back. Thereby, the graph construction of the video information is completed.
      3 similarity comparison
      And comparing the similarity of the composition corresponding to the text with the composition  1 constructed by the video, and taking the graph with the highest similarity as a search result of the video.
      In an embodiment, as shown in fig. 1-2, a video search method based on graph model comparison according to the present invention includes:
      both Stanford Parser and fast-RCNN will be trained using the MovieGraph, which includes video assets and commentary for each asset in the dataset. The data set is voluminous and convincing. The Stanford Parser will use text as its training, and the corresponding sentence component will be labeled manually. The Faster-RCNN will also be manually labeled accordingly. First, a data set is described in terms:
      training set: refers to a sample set used for training, which is used to train parameters in the corresponding neural network.
      And (4) verification set: a data set corresponding to the network model is validated. And after the training of the network on the training set is finished, comparing and judging the performance of the network model through the data set.
      And (3) test set: and detecting and evaluating the performance of the neural network for the trained network.
      MovieLens dataset: a data set relating to movie ratings.
      1, data acquisition:
      the complete public data set is used as a data sample. A complete data set is used as a testing machine, samples are randomly cut according to the proportion of 8: 1, and the samples are divided into a training set, a verification set and a testing set.
      2 text information composition:
      1) stanford Parser training:
      the Stanford Parser is trained through the relevant data set.
      2) Data processing:
      and inputting the related sentences into the Stanford Parser, correspondingly outputting certain sentence components, and directly deleting the sentences which do not meet the subject + predicate structure in the corresponding sentence components. The results were also patterned as described above, see FIG. 1.
      3 video information composition:
      1) fast-RCNN training:
      the fast-RCNN was trained on the relevant data set.
      2) Data processing:
      and performing related object detection corresponding to each frame, converting the image resource into a subject + predicate form, and then performing certain composition corresponding to the classification result.
      4 similarity comparison:
      1) when the similarity comparison is carried out, all video resources are stored in a database for searching, and text resources are used as single search input. Wherein a node is identified for the same root subject and corresponding predicate.
      2) For the two formed figures, the figure 1 formed by the text is preprocessed. And classifying related subjects for each node, wherein the classification category is the same as the subject category preset by the video resource, and the same processing is carried out on the action, so that the two graphs can be corresponding to the same dimension. The specific classification method is to directly use the convolution layer to extract the relevant characteristics of the words and compare the characteristics of each category by using the characteristics. And selecting the character category corresponding to the most similar characteristic graph as a classification result. (similarity of the characteristic diagram directly adopts a desired subtraction module);
      3) for the graph after preprocessing, the information contained in one node is identical and can be considered as the same node, the length of the edge connected between each node is set to be 1, and the distance of a plurality of nodes is understood as distance accumulation. (if two sides need to be taken in the middle, the distance is 2);
      4) and then the whole similarity comparison problem can obtain a result by utilizing an approximate algorithm of a quadratic programming problem. The similarity comparison method is adopted for each graph in the text and the database, and finally the graph with the highest similarity is selected as a search result.
      The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims. It should be noted that the various technical features described in the above embodiments can be combined in any suitable manner without contradiction, and the invention is not described in any way for the possible combinations in order to avoid unnecessary repetition. In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as the disclosure of the present invention as long as it does not depart from the spirit of the present invention.
    Claims (4)
1. A video searching method based on graph model comparison is characterized by comprising the following steps:
      step 1, composing a text message in voice:
      inputting sentences into a Stanford Parser, directly deleting the sentences which do not meet the subject + predicate structure in the output sentence components, and simultaneously composing the text information of the result;
      step 2, performing video information composition on the video:
      carrying out target detection on each image frame, converting image resources into a subject + predicate form, and then carrying out video information composition on a classification result;
      step 3, carrying out similarity contrast on the text information composition and the video information composition:
      extracting relevant characteristics of words by using the convolutional layer, comparing the characteristics of various categories by using the characteristics, and selecting the character category corresponding to the most similar characteristic diagram as a classification result;
      and (4) adopting a similarity comparison method for each image in the text and the database, and finally selecting the image with the highest similarity as a search result.
    2. The method as claimed in claim 1, wherein the text message composition in step 1 is training Stanford Parser through related data set.
    3. The method of claim 1, wherein the video information in step 2 is patterned into fast-RCNN training through a correlation data set.
    4. The video search method based on graph model comparison as claimed in claim 1, wherein in the step 3, the similarity comparison specifically comprises:
      1) when the similarity comparison is carried out, all video resources are stored in a database for searching, and the text resources are used as single search input; wherein, regarding the same subjects and corresponding predicates, the nodes can be identified;
      2) for the two formed graphs, preprocessing the graph formed by the text; for each node, the related subject is classified, the classification category is the same as the subject category preset by the video resource, and the same processing is carried out on the action, so that the two graphs can be corresponding to the same dimension; the specific classification method comprises the steps of directly extracting relevant features of words by using the convolutional layer, comparing the features of various categories by using the features, and selecting the character category corresponding to the most similar feature map as a classification result;
      3) for the graph after preprocessing, the information contained in one node is completely the same and can be considered as the same node, the length of the edge connected between each node is set to be 1, and the distance of a plurality of nodes is understood as distance accumulation;
      4) then, the whole similarity comparison problem can obtain a result by utilizing an approximate algorithm of a quadratic programming problem; the similarity comparison method is adopted for each graph in the text and the database, and finally the graph with the highest similarity is selected as a search result.
    Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202011123040.7A CN112256913A (en) | 2020-10-19 | 2020-10-19 | Video searching method based on graph model comparison | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202011123040.7A CN112256913A (en) | 2020-10-19 | 2020-10-19 | Video searching method based on graph model comparison | 
Publications (1)
| Publication Number | Publication Date | 
|---|---|
| CN112256913A true CN112256913A (en) | 2021-01-22 | 
Family
ID=74244443
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202011123040.7A Pending CN112256913A (en) | 2020-10-19 | 2020-10-19 | Video searching method based on graph model comparison | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN112256913A (en) | 
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN103440274A (en) * | 2013-08-07 | 2013-12-11 | 北京航空航天大学 | Video event sketch construction and matching method based on detail description | 
| US20130343597A1 (en) * | 2012-06-26 | 2013-12-26 | Aol Inc. | Systems and methods for identifying electronic content using video graphs | 
| CN103942337A (en) * | 2014-05-08 | 2014-07-23 | 北京航空航天大学 | Video search system based on image recognition and matching | 
| CN107273517A (en) * | 2017-06-21 | 2017-10-20 | 复旦大学 | Picture and text cross-module state search method based on the embedded study of figure | 
| CN110659392A (en) * | 2019-09-29 | 2020-01-07 | 北京市商汤科技开发有限公司 | Retrieval method and device, and storage medium | 
- 
        2020
        - 2020-10-19 CN CN202011123040.7A patent/CN112256913A/en active Pending
 
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US20130343597A1 (en) * | 2012-06-26 | 2013-12-26 | Aol Inc. | Systems and methods for identifying electronic content using video graphs | 
| CN103440274A (en) * | 2013-08-07 | 2013-12-11 | 北京航空航天大学 | Video event sketch construction and matching method based on detail description | 
| CN103942337A (en) * | 2014-05-08 | 2014-07-23 | 北京航空航天大学 | Video search system based on image recognition and matching | 
| CN107273517A (en) * | 2017-06-21 | 2017-10-20 | 复旦大学 | Picture and text cross-module state search method based on the embedded study of figure | 
| CN110659392A (en) * | 2019-09-29 | 2020-01-07 | 北京市商汤科技开发有限公司 | Retrieval method and device, and storage medium | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| CN111191078B (en) | Video information processing method and device based on video information processing model | |
| CN110119786B (en) | Text topic classification method and device | |
| US11151191B2 (en) | Video content segmentation and search | |
| CN113590850A (en) | Multimedia data searching method, device, equipment and storage medium | |
| CN111046225B (en) | Audio resource processing method, device, equipment and storage medium | |
| CN109299277A (en) | Public opinion analysis method, server and computer-readable storage medium | |
| US12106750B2 (en) | Multi-modal interface in a voice-activated network | |
| US20230004830A1 (en) | AI-Based Cognitive Cloud Service | |
| CN112163560A (en) | Video information processing method and device, electronic equipment and storage medium | |
| CN114661872B (en) | A beginner-oriented API adaptive recommendation method and system | |
| CN114048335B (en) | A user interaction method and device based on knowledge base | |
| CN118779492B (en) | Multi-mode large model driven video understanding and searching method | |
| CN116977701A (en) | Video classification model training method, video classification method and device | |
| CN117312529A (en) | Information acquisition method, device, equipment, server, cluster and storage medium thereof | |
| CN118503390A (en) | An automatic optimization method and system based on intelligent data memory | |
| CN119441573B (en) | A news intelligent broadcasting method and system based on artificial intelligence | |
| CN119539076B (en) | A real-time speech streaming and text dialogue interaction system based on a large language model | |
| CN112861580A (en) | Video information processing method and device based on video information processing model | |
| CN118170919B (en) | A method and system for classifying literary works | |
| CN111949781B (en) | Intelligent interaction method and device based on natural sentence syntactic analysis | |
| CN109063127A (en) | A kind of searching method, device, server and storage medium | |
| CN119088929A (en) | A well engineering intelligent question-answering system and method | |
| CN113204670A (en) | Attention model-based video abstract description generation method and device | |
| CN112256913A (en) | Video searching method based on graph model comparison | |
| CN117009170A (en) | Training sample generation method, device, equipment and storage medium | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication | Application publication date: 20210122 |