CN106919652B - Short-sighted frequency automatic marking method and system based on multi-source various visual angles transductive learning - Google Patents
Short-sighted frequency automatic marking method and system based on multi-source various visual angles transductive learning Download PDFInfo
- Publication number
- CN106919652B CN106919652B CN201710051411.7A CN201710051411A CN106919652B CN 106919652 B CN106919652 B CN 106919652B CN 201710051411 A CN201710051411 A CN 201710051411A CN 106919652 B CN106919652 B CN 106919652B
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- view
- short video
- msup
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Library & Information Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
- Image Analysis (AREA)
Abstract
本发明公开了一种基于多源多视角直推式学习的短视频自动标注方法,包括:获取短视频数据;对所述短视频数据进行预处理,生成一致格式图像关键帧、音轨、文本和语义标签;提取所述图像关键帧、音轨和文本的多视角特征向量;建立短视频标注数据库,所述多视角特征向量和所述语义标签存储在所述短视频标注数据库中;计算所述多视角特征向量之间的相似度;通过所述多视角特征向量的相似度建立多视角融合子空间;直推式求解所述多视角融合子空间,将所述语义标签自动标注在待标注短视频数据上。以及一种基于多源多视角直推式学习的短视频自动标注系统。本发明充分考虑了短视频数据附带的多源信息,提高了标注准确率。
The invention discloses a short video automatic tagging method based on multi-source multi-view transductive learning, comprising: acquiring short video data; preprocessing the short video data to generate image key frames, audio tracks, and text in a consistent format and semantic label; extract the multi-view feature vector of the image key frame, audio track and text; set up a short video annotation database, the multi-view feature vector and the semantic label are stored in the short video annotation database; calculate the The similarity between the multi-view feature vectors; the multi-view fusion subspace is established through the similarity of the multi-view feature vectors; the transductive solution to the multi-view fusion subspace is automatically marked in the semantic label to be marked on short video data. And a short video automatic labeling system based on multi-source and multi-view transductive learning. The invention fully considers the multi-source information attached to the short video data, and improves the labeling accuracy.
Description
技术领域technical field
本发明涉及短视频标注领域,具体说是一种基于多源多视角直推式学习的短视频自动标注方法与系统。The invention relates to the field of short video tagging, in particular to a short video automatic tagging method and system based on multi-source and multi-view direct push learning.
背景技术Background technique
随着移动通讯技术和互联网技术的发展和各种智能终端的普及,通过手机、平板电脑等终端拍摄并进行社交圈分享的短视频成为广受用户喜爱的社交应用。这种短视频社交方式起源于2013年的视频分享网站Vine,其移动客户端可将短视频拍摄的时间限定在6秒。各种短视频应用,如Instagram、美拍、微信、微博、腾讯微视、微可拍等近年来迅速发展,短视频的时间长度也拓展到60秒。短视频与互联网多种社交平台无缝连接,在拍摄后直接传播到社交网络。短视频融合了视觉、听觉、文字和用户评论等动态视角,可以更加直观、立体地满足用户的表达和沟通需要。因此,短视频更加契合移动通讯的时代风向,也更加符合用户及时录制、迅速编辑和快速分享的习惯。动态短视频交互带来的信息量更为多元化,更易于推动话题传播,而各种短视频分享平台通过评论、关注和转发机制使得短视频用户越来越多,用户粘性越来越强,所以短视频是一类新媒体形式。With the development of mobile communication technology and Internet technology and the popularization of various smart terminals, short videos shot through mobile phones, tablet computers and other terminals and shared in social circles have become popular social applications among users. This short video social method originated from the video sharing website Vine in 2013. Its mobile client can limit the shooting time of short videos to 6 seconds. Various short video applications, such as Instagram, Meipai, WeChat, Weibo, Tencent Weishi, Weikepai, etc., have developed rapidly in recent years, and the duration of short videos has also expanded to 60 seconds. Short videos are seamlessly connected to various social platforms on the Internet, and are directly spread to social networks after shooting. Short videos integrate dynamic perspectives such as vision, hearing, text and user comments, which can meet users' expression and communication needs more intuitively and three-dimensionally. Therefore, short videos are more in line with the trend of the mobile communication era, and are more in line with users' habits of timely recording, rapid editing and rapid sharing. The amount of information brought by dynamic short video interaction is more diversified, and it is easier to promote the dissemination of topics, and various short video sharing platforms have made more and more short video users through commenting, following and reposting mechanisms, and user stickiness has become stronger and stronger. So short video is a new type of media.
对大量短视频内容进行检索与管理具有非常重要的作用。同时,从网络信息管理的角度来看,对视频内容进行分析、挖掘和筛选也成为了需要迫切解决的问题。视频本身是集图像、声音等为一体的综合性媒体,包含的信息量大,对其进行检索本身就是一个难点。很多搜索引擎已经可以处理文本检索,但是仍无有效的方法检索视频数据,其主要原因是由于缺乏有效方法对这种多媒体数据建立索引。视频标注是给予视频文本标签的过程,给定标签描述了视频的内容语义,实现了视频的底层特征到语义概念之间的映射,得到的标签作为元数据,可以进行高效索引。而相应的视频检索与分析就转化为了传统的关键字与挖掘。It plays a very important role in retrieving and managing a large amount of short video content. At the same time, from the perspective of network information management, the analysis, mining and screening of video content has become an urgent problem to be solved. Video itself is a comprehensive media integrating images, sounds, etc. It contains a large amount of information, and its retrieval itself is a difficult point. Many search engines can already handle text retrieval, but there is still no effective method to retrieve video data, the main reason is that there is no effective method to index this multimedia data. Video annotation is the process of giving video text tags. The given tags describe the content semantics of the video, and realize the mapping between the underlying features of the video and semantic concepts. The obtained tags can be used as metadata for efficient indexing. The corresponding video retrieval and analysis are transformed into traditional keyword and mining.
现有专利主要专注于传统长时间序列视频,而且仅仅通过视觉内容的单视角特征分析进行标注。如中国发明专利(申请号201511021303.2)“视频标注方法和装置”应用视频帧中物体出现频率设置每个视频段的标识信息。中国发明专利(申请号201010134073.1)“一种视频标注方法”通过视觉颜色特征的单视角相似度计算实现视频标注,没有应用音频和文本信息。中国发明专利(申请号201511005264.7)“一种视频标注方法和视频标注装置”发明了一种基于图像比对的视频标注方法,通过清晰度较好的参照视频图像和同场景的待标注图像进行叠加比对,突出敏感目标位置。中国发明专利(申请号201511005264.7)“一种视频搜索方法及装置”通过对视频帧标签进行预测、合并以及标注,确定视频帧标签,但是仅利用了视觉单视角特征。中国发明专利(申请号201210075050.7)“基于特征袋模型和监督学习的视频语义标注方法”通过词汇本体库和视觉词汇的建模,利用支持向量机实现视觉词汇标注,但是该方法仅仅利用了SIFT特征。Existing patents mainly focus on traditional long-term video sequences, and are only annotated by single-view feature analysis of visual content. For example, the Chinese Invention Patent (Application No. 201511021303.2) "Video Labeling Method and Device" applies the frequency of occurrence of objects in video frames to set the identification information of each video segment. Chinese Invention Patent (Application No. 201010134073.1) "A Video Labeling Method" realizes video labeling by single-view similarity calculation of visual color features, without applying audio and text information. Chinese Invention Patent (Application No. 201511005264.7) "A Video Labeling Method and Video Labeling Device" invented a video labeling method based on image comparison, which superimposes the reference video image with better clarity and the image to be labeled in the same scene Compare and highlight sensitive target positions. Chinese Invention Patent (Application No. 201511005264.7) "A Video Search Method and Device" determines video frame tags by predicting, merging, and labeling video frame tags, but only uses visual single-view features. Chinese Invention Patent (Application No. 201210075050.7) "Video Semantic Annotation Method Based on Feature Bag Model and Supervised Learning" uses support vector machine to realize visual vocabulary annotation through vocabulary ontology library and visual vocabulary modeling, but this method only uses SIFT features .
现有文献所提出的标注方法也局限于通过视觉内容的单视角特征分析进行标注[1]。如文献[2]通过将视频帧的SIFT视觉单词量化特征输入支持向量机,完成视频的分类标注。文献[3]通过支持向量机对HSI颜色特征、形状特征等视觉特征进行分类标注。文献[4]提取样本关键帧的颜色和纹理特征向量,进行元学习策略训练,进而形成视觉词典。上述方法仅通过视觉分析进行语义标注,均不具备多源多视角特征融合能力,不适用于具备多源多视角描述的短视频媒体。The annotation methods proposed in the existing literature are also limited to annotation through single-view feature analysis of visual content [1]. For example, in [2], the SIFT visual word quantization feature of the video frame is input into the support vector machine to complete the classification and labeling of the video. Literature [3] uses support vector machine to classify and label visual features such as HSI color features and shape features. Literature [4] extracts the color and texture feature vectors of sample key frames, conducts meta-learning strategy training, and then forms a visual dictionary. The above methods only perform semantic annotation through visual analysis, and none of them have the ability to fuse multi-source and multi-view features, and are not suitable for short video media with multi-source and multi-view descriptions.
不同于传统的视频标注,网络短视频标注具备自己的特性。比如,由于其视频时间短,因此无需镜头分割;由于其快速分享,因此不具备文本字幕;由于其附带社会属性,因此其具备内容简介、用户评论等多源信息,而这些信息对于短视频内容标注具有较大的作用。Different from traditional video annotation, network short video annotation has its own characteristics. For example, because of its short video time, there is no need for lens segmentation; because of its fast sharing, it does not have text subtitles; because of its social attributes, it has multi-source information such as content introduction, user comments, etc., which are very important for short video content. Annotations have a greater effect.
文献[1]尹文杰,韩军伟,郭雷.图像与视频自动标注最新进展[J].计算机科学,2011,38(12):12-16。Literature [1] Yin Wenjie, Han Junwei, Guo Lei. The latest progress in automatic annotation of images and videos [J]. Computer Science, 2011,38(12):12-16.
文献[2]王晗,吴心筱,贾云得.使用异构互联网图像组的视频标注[J].计算机学报,2013,36(10):2062-2069。Literature [2] Wang Han, Wu Xinxiao, Jia Yunde. Video Annotation Using Heterogeneous Internet Image Groups [J]. Journal of Computer Science, 2013,36(10):2062-2069.
文献[3]张建明,孙春梅,闫婷.基于自适应SVM的半监督主动学习视频标注[J].计算机工程,2013,39(8):190-195。Literature [3] Zhang Jianming, Sun Chunmei, Yan Ting. Semi-supervised active learning video annotation based on adaptive SVM [J]. Computer Engineering, 2013,39(8):190-195.
文献[4]崔桐,徐欣.一种基于语义分析的大数据视频标注方法[J].南京航空航天大学学报,2016,48(5)。Literature [4] Cui Tong, Xu Xin. A Big Data Video Annotation Method Based on Semantic Analysis [J]. Journal of Nanjing University of Aeronautics and Astronautics, 2016, 48(5).
发明内容Contents of the invention
有鉴于此,本发明提供基于多源多视角直推式学习的短视频自动标注方法与系统。In view of this, the present invention provides a short video automatic labeling method and system based on multi-source and multi-view transductive learning.
第一方面,本发明提供基于多源多视角直推式学习的短视频自动标注方法,包括:In the first aspect, the present invention provides a short video automatic tagging method based on multi-source multi-view transductive learning, including:
获取短视频数据;Obtain short video data;
对所述短视频数据进行预处理,生成一致格式图像关键帧、音轨、文本和语义标签;Preprocessing the short video data to generate image keyframes, audio tracks, text and semantic labels in a consistent format;
提取所述图像关键帧、音轨和文本的多视角特征向量;Extracting the multi-view feature vectors of the key frame of the image, the audio track and the text;
建立短视频标注数据库,所述多视角特征向量和所述语义标签存储在所述短视频标注数据库中;Set up a short video annotation database, and the multi-view feature vector and the semantic label are stored in the short video annotation database;
计算所述多视角特征向量之间的相似度;Calculating the similarity between the multi-view feature vectors;
通过所述多视角特征向量的相似度建立多视角融合子空间;Establishing a multi-view fusion subspace through the similarity of the multi-view feature vectors;
直推式求解所述多视角融合子空间,将所述语义标签自动标注在待标注短视频数据上;Straightforwardly solving the multi-view fusion subspace, automatically marking the semantic label on the short video data to be marked;
其中,所述待标注短视频数据为所述短视频数据经过预处理后不具有语义标签的短视频数据;Wherein, the short video data to be marked is short video data without semantic tags after the short video data is preprocessed;
其中,所述短视频数据为视觉图像、音频、内容简介、用户评论组成的多源信息。Wherein, the short video data is multi-source information composed of visual images, audio, content introduction, and user comments.
优选地,所述多视角特征向量相似度的计算包括:Preferably, the calculation of the similarity of the multi-view feature vectors includes:
计算所述多视角特征向量之间的距离 Calculate the distance between the multi-view feature vectors
计算所述多视角特征向量的各视角特征向量的均值σk;Calculating the mean value σ k of each view feature vector of the multi-view feature vector;
利用所述多视角特征向量之间的距离和所述各视角特征向量的均值σk,采用高斯核函数相似度计算0到1区间的所述多视角特征向量之间的相似度,第k个视角的相似度组成第k个视角的相似度矩阵Sk;Using the distance between the multi-view feature vectors and the mean value σ k of the feature vectors of each view, using the Gaussian kernel function similarity to calculate the similarity between the multi-view feature vectors in the interval from 0 to 1, the similarity of the k-th view constitutes the similarity of the k-th view degree matrix S k ;
第k个视角的相似度矩阵Sk的第i行第j列元素的计算公式为:The calculation formula of the i-th row and j-th column element of the similarity matrix S k of the k-th viewing angle is:
其中,X={x1,x2,...,xN,xN+1,...,xN+M}为所述短视频数据的样本集合,具有所述语义标签的短视频数据的数量为N,所述待标注短视频样本数量为M,对所述样本集合采用K个视角进行描述,为第i个视频的第k个视角的特征向量,为第j个视频的第k个视角的特征向量,1≤k≤K,1≤i≤N+M,1≤j≤N+M。Wherein, X={x 1 ,x 2 ,...,x N ,x N+1 ,...,x N+M } is the sample set of the short video data, and the short video with the semantic label The number of data is N, the number of short video samples to be marked is M, and the sample set is described using K perspectives, is the feature vector of the kth viewing angle of the i-th video, is the feature vector of the k-th view of the j-th video, 1≤k≤K, 1≤i≤N+M, 1≤j≤N+M.
优选地,所述多视角融合子空间的建立方法包括:Preferably, the establishment method of the multi-view fusion subspace includes:
分别计算所述第k个视角的相似度矩阵Sk的规范化拉普拉斯相似度矩阵L(Sk),1≤k≤K:Calculate the normalized Laplacian similarity matrix L(S k ) of the similarity matrix S k of the k-th viewing angle respectively, 1≤k≤K:
其中,I是(N+M)×(N+M)阶单位阵,为第k个视角的相似矩阵Sk的度矩阵,即其主对角线元素为Sk的列的和,而非主对角元素为0;Wherein, I is (N+M)×(N+M) order unit matrix, is the degree matrix of the similarity matrix S k of the k-th viewing angle, that is, the main diagonal element is the sum of the columns of S k , and the non-main diagonal element is 0;
计算所述规范化拉普拉斯相似度矩阵L(Sk)的迹归一化矩阵:Calculate the trace normalization matrix of the normalized Laplacian similarity matrix L(S k ):
多视角融合子空间相似度矩阵S0的迹归一化处理:Trace normalization processing of multi-view fusion subspace similarity matrix S 0 :
其中,为随机初始化的K个视角的线性组合系数。in, is the linear combination coefficient of K views initialized randomly.
优选地,所述直推式求解所述多视角融合子空间的方法包括:Preferably, the transductive method for solving the multi-view fusion subspace includes:
构建目标函数,求取预测标签矩阵f:Construct the objective function to obtain the predicted label matrix f:
其中,为所述语义标签构成的标签集合,所述短视频数据的样本xi具备第j个标签,则Yij=1,否则Yij=0,C是所述标签集合的规模;in, Be the label set that described semantic label forms, the sample x i of described short video data has the jth label, then Y ij =1, otherwise Y ij =0, C is the scale of described label set;
其中,为随机初始化的预测标签矩阵;in, is the predicted label matrix initialized randomly;
首先,所述目标函数依据拉格朗日乘子法转化为First, the objective function is transformed into
然后,固定f,求解β,求Γ对β的偏导数,令Γ对β的偏导数为0,得到所述K个视角的线性组合系数β;Then, fix f, solve for β, seek the partial derivative of Γ to β, make the partial derivative of Γ to β be 0, and obtain the linear combination coefficient β of the K viewing angles;
最后,固定β,求解f:分别求取Γ对fi(1≤i≤N)的偏导数和Γ对fi(N+1≤i≤N+M)的偏导数,令所述Γ对fi(1≤i≤N)的偏导数和Γ对fi(N+1≤i≤N+M)的偏导数为0,得到所述预测标签矩阵f;Finally, fix β and solve for f: obtain the partial derivative of Γ to f i (1≤i≤N) and the partial derivative of Γ to f i (N+1≤i≤N+M) respectively, let the Γ to The partial derivative of f i (1≤i≤N) and the partial derivative of Γ to f i (N+1≤i≤N+M) are 0, and the predicted label matrix f is obtained;
重复交替计算所述的“固定f,求解β”和“固定β,求解f”的步骤,直至预测标签矩阵f收敛,得到融合子空间的线性组合系数β和预测标签矩阵f,完成待标注短视频数据的标注;Repeat the steps of "fix f, solve β" and "fix β, solve f" alternately until the predicted label matrix f converges, and the linear combination coefficient β of the fusion subspace and the predicted label matrix f are obtained. Annotation of video data;
其中, in,
其中, in,
其中,表示的第t列;in, express the tth column of
其中,δ为非负的拉格朗日乘子。Among them, δ is a non-negative Lagrangian multiplier.
优选地,所述目标函数的参数λ、μ、θ采用10-折交叉验证进行确定。Preferably, the parameters λ, μ, θ of the objective function are determined using 10-fold cross-validation.
第二方面,本发明提供基于多源多视角直推式学习的短视频自动标注系统,包括:In the second aspect, the present invention provides a short video automatic tagging system based on multi-source multi-view transductive learning, including:
短视频数据获取单元,用于获取短视频数据;A short video data acquisition unit, configured to acquire short video data;
短视频数据预处理单元,分别与所述短视频数据获取单元和短视频标注数据库单元连接,用于对所述短视频数据进行预处理,生成一致格式图像关键帧、音轨、文本和语义标签,所述语义标签存储在数据库中;The short video data preprocessing unit is respectively connected with the short video data acquisition unit and the short video annotation database unit, and is used to preprocess the short video data to generate consistent format image key frames, audio tracks, text and semantic labels , the semantic tags are stored in the database;
提取多视角特征向量单元,分别与所述短视频数据预处理单元和所述短视频标注数据库单元连接,用于提取所述图像关键帧、音轨和文本的多视角特征向量;Extract the multi-view feature vector unit, which is respectively connected with the short video data preprocessing unit and the short video annotation database unit, for extracting the multi-view feature vector of the image key frame, audio track and text;
所述短视频标注数据库单元,用于存储所述多视角特征向量和所述语义标签;The short video annotation database unit is used to store the multi-view feature vector and the semantic label;
多视角特征向量相似度计算单元,与所述短视频标注数据库单元连接,用于计算所述多视角特征向量之间的相似度;A multi-view feature vector similarity calculation unit is connected to the short video labeling database unit for calculating the similarity between the multi-view feature vectors;
多视角融合子空间直推式求解单元,与所述多视角特征向量相似度计算单元连接,通过所述多视角特征向量相似度,建立多视角融合子空间,直推式求解所述多视角融合子空间,将所述语义标签自动标注在待标注短视频数据上;The multi-view fusion subspace transductive solving unit is connected with the multi-view feature vector similarity calculation unit, and the multi-view fusion subspace is established through the multi-view feature vector similarity, and the multi-view fusion subspace is directly solved. Subspace, the semantic label is automatically marked on the short video data to be marked;
其中,所述待标注短视频数据为所述短视频数据经过预处理后不具有语义标签的短视频数据;Wherein, the short video data to be marked is short video data without semantic tags after the short video data is preprocessed;
其中,所述短视频数据为视觉图像、音频、内容简介、用户评论组成的多源信息。Wherein, the short video data is multi-source information composed of visual images, audio, content introduction, and user comments.
优选地,所述多视角特征向量相似度计算单元包括:Preferably, the multi-view feature vector similarity calculation unit includes:
多视角特征向量之间的距离计算模块,用于计算所述多视角特征向量之间的距离 The distance between multi-view feature vectors Calculation module, used to calculate the distance between the multi-view feature vectors
多视角特征向量的各视角特征向量的均值σk计算模块,用于计算所述多视角特征向量的各视角特征向量的均值σk;The mean σ k calculation module of each view feature vector of the multi-view feature vector is used to calculate the mean σ k of each view feature vector of the multi-view feature vector;
多视角特征向量相似度求取模块,分别与所述多视角特征向量之间的距离计算模块和多视角特征向量的各视角特征向量的均值σk计算模块连接,采用高斯核函数相似度计算0到1区间的所述多视角特征向量之间的相似度,第k个视角的相似度组成第k个视角的相似度矩阵Sk;The multi-view feature vector similarity calculation module, the distance between the multi-view feature vector and the multi-view feature vector The calculation module is connected to the mean value σ k calculation module of each view feature vector of the multi-view feature vector, and the similarity between the multi-view feature vectors in the 0 to 1 interval is calculated by using the Gaussian kernel function similarity, and the similarity of the kth view is degree to form the similarity matrix S k of the kth viewing angle;
第k个视角的相似度矩阵Sk的第i行第j列元素的计算公式为:The calculation formula of the i-th row and j-th column element of the similarity matrix S k of the k-th viewing angle is:
其中,X={x1,x2,...,xN,xN+1,...,xN+M}为所述短视频数据的样本集合,具有所述语义标签的短视频数据的数量为N,所述待标注短视频样本数量为M,对所述样本集合采用K个视角进行描述,为第i个视频的第k个视角的特征向量,为第j个视频的第k个视角的特征向量,1≤k≤K,1≤i≤N+M,1≤j≤N+M。Wherein, X={x 1 ,x 2 ,...,x N ,x N+1 ,...,x N+M } is the sample set of the short video data, and the short video with the semantic label The number of data is N, the number of short video samples to be marked is M, and the sample set is described using K perspectives, is the feature vector of the kth viewing angle of the i-th video, is the feature vector of the k-th view of the j-th video, 1≤k≤K, 1≤i≤N+M, 1≤j≤N+M.
优选地,所述多视角融合子空间直推式求解单元包括:Preferably, the multi-view fusion subspace direct calculation unit includes:
多视角融合子空间建立模块,所述多视角融合子空间建立模块对所述多视角融合子空间建立过程如下:The multi-view fusion subspace establishment module, the multi-view fusion subspace establishment module is as follows for the establishment process of the multi-view fusion subspace:
分别计算所述第k个视角的相似度矩阵Sk的规范化拉普拉斯相似度矩阵L(Sk),1≤k≤K:Calculate the normalized Laplacian similarity matrix L(S k ) of the similarity matrix S k of the k-th viewing angle respectively, 1≤k≤K:
其中,I是(N+M)×(N+M)阶单位阵,为第k个视角的相似矩阵Sk的度矩阵,即其主对角线元素为Sk的列的和,而非主对角元素为0;Wherein, I is (N+M)×(N+M) order unit matrix, is the degree matrix of the similarity matrix S k of the k-th viewing angle, that is, the main diagonal element is the sum of the columns of S k , and the non-main diagonal element is 0;
计算所述规范化拉普拉斯相似度矩阵L(Sk)的迹归一化矩阵:Calculate the trace normalization matrix of the normalized Laplacian similarity matrix L(S k ):
多视角融合子空间相似度矩阵S0的迹归一化处理:Trace normalization processing of multi-view fusion subspace similarity matrix S 0 :
其中,为随机初始化的K个视角的线性组合系数。in, is the linear combination coefficient of K views initialized randomly.
优选地,所述多视角融合子空间直推式求解单元进一步包括:Preferably, the multi-view fusion subspace direct calculation unit further includes:
多视角融合子空间求解模块,与所述多视角融合子空间建立模块连接,所述多视角融合子空间求解模块对所述多视角融合子空间的求解过程如下:The multi-view fusion subspace solving module is connected with the multi-view fusion subspace module, and the multi-view fusion subspace solving process of the multi-view fusion subspace by the multi-view fusion subspace is as follows:
构建目标函数,求取预测标签矩阵f:Construct the objective function to obtain the predicted label matrix f:
其中,为所述语义标签构成的标签集合,所述短视频数据的样本xi具备第j个标签,则Yij=1,否则Yij=0,C是所述标签集合的规模;in, Be the label set that described semantic label forms, the sample x i of described short video data has the jth label, then Y ij =1, otherwise Y ij =0, C is the scale of described label set;
其中,为随机初始化的预测标签矩阵;in, is the predicted label matrix initialized randomly;
首先,所述目标函数依据拉格朗日乘子法转化为First, the objective function is transformed into
然后,固定f,求解β,求Γ对β的偏导数,令Γ对β的偏导数为0,得到所述K个视角的线性组合系数β;Then, fix f, solve for β, seek the partial derivative of Γ to β, make the partial derivative of Γ to β be 0, and obtain the linear combination coefficient β of the K viewing angles;
最后,固定β,求解f:分别求取Γ对fi(1≤i≤N)的偏导数和Γ对fi(N+1≤i≤N+M)的偏导数,令所述Γ对fi(1≤i≤N)的偏导数和Γ对fi(N+1≤i≤N+M)的偏导数为0,得到所述预测标签矩阵f;Finally, fix β and solve for f: obtain the partial derivative of Γ to f i (1≤i≤N) and the partial derivative of Γ to f i (N+1≤i≤N+M) respectively, let the Γ to The partial derivative of f i (1≤i≤N) and the partial derivative of Γ to f i (N+1≤i≤N+M) are 0, and the predicted label matrix f is obtained;
重复交替计算所述的“固定f,求解β”和“固定β,求解f”的步骤,直至预测标签矩阵f收敛,得到融合子空间的线性组合系数β和预测标签矩阵f,完成待标注短视频数据的标注;Repeat the steps of "fix f, solve β" and "fix β, solve f" alternately until the predicted label matrix f converges, and the linear combination coefficient β of the fusion subspace and the predicted label matrix f are obtained. Annotation of video data;
其中, in,
其中, in,
其中,表示的第t列;in, express the tth column of
其中,δ为非负的拉格朗日乘子。Among them, δ is a non-negative Lagrangian multiplier.
优选地,所述目标函数的参数λ、μ、θ采用10-折交叉验证进行确定。Preferably, the parameters λ, μ, θ of the objective function are determined using 10-fold cross-validation.
本发明至少具有如下有益效果:The present invention has at least the following beneficial effects:
本发明考虑了短视频数据附带的多源信息,同时获取带有语义标签和不带有语义标签的短视频数据,对短视频数据进行预处理,生成一致格式图像关键帧、音轨、文本和语义标签,对不具有语义标签的待标注短视频数据进行自动标注,提高了标注准确率。The present invention considers the multi-source information attached to the short video data, acquires short video data with and without semantic tags at the same time, preprocesses the short video data, and generates image key frames, audio tracks, text and Semantic tags, which automatically tag the short video data to be tagged without semantic tags, improving the accuracy of tagging.
附图说明Description of drawings
通过以下参考附图对本发明实施例的描述,本发明的上述以及其它目的、特征和优点更为清楚,在附图中:Through the following description of the embodiments of the present invention with reference to the accompanying drawings, the above-mentioned and other objects, features and advantages of the present invention are more clear, in the accompanying drawings:
图1是本发明实施例的一种基于多源多视角直推式学习的短视频自动标注方法的流程图;Fig. 1 is a flow chart of a short video automatic tagging method based on multi-source multi-view transductive learning according to an embodiment of the present invention;
图2是本发明实施例的短视频标注数据库的结构示意图;Fig. 2 is a schematic structural diagram of a short video annotation database according to an embodiment of the present invention;
图3是本发明实施例的一种基于多源多视角直推式学习的短视频自动标注系统的原理框图。Fig. 3 is a functional block diagram of a short video automatic tagging system based on multi-source multi-view transductive learning according to an embodiment of the present invention.
具体实施方式Detailed ways
以下基于实施例对本发明进行描述,但是值得说明的是,本发明并不限于这些实施例。在下文对本发明的细节描述中,详尽描述了一些特定的细节部分。然而,对于没有详尽描述的部分,本领域技术人员也可以完全理解本发明。The present invention will be described below based on examples, but it should be noted that the present invention is not limited to these examples. In the following detailed description of the invention, some specific details are set forth in detail. However, the present invention can be fully understood by those skilled in the art about the parts that are not described in detail.
此外,本领域普通技术人员应当理解,所提供的附图只是为了说明本发明的目的、特征和优点,附图并不是实际按照比例绘制的。In addition, those of ordinary skill in the art should understand that the provided drawings are only for illustrating the objects, features and advantages of the present invention, and the drawings are not actually drawn to scale.
同时,除非上下文明确要求,否则整个说明书和权利要求书中的“包括”、“包含”等类似词语应当解释为包含的含义而不是排他或穷举的含义;也就是说,是“包含但不限于”的含义。At the same time, unless the context clearly requires, the words "include", "include" and other similar words in the entire specification and claims should be interpreted as an inclusive meaning rather than an exclusive or exhaustive meaning; that is, "include but not limited to the meaning of ".
图1是本发明实施例的一种基于多源多视角直推式学习的短视频自动标注方法的流程图。如图1所示,一种基于多源多视角直推式学习的短视频自动标注方法包括步骤101获取短视频数据、步骤102短视频数据预处理、步骤103多视角特征向量抽取、步骤104建立短视频标注数据库、步骤105计算多视角特征向量相似度和步骤106直推式求解多视角融合子空间。FIG. 1 is a flow chart of a short video automatic labeling method based on multi-source and multi-view transductive learning according to an embodiment of the present invention. As shown in Figure 1, a short video automatic labeling method based on multi-source and multi-view transductive learning includes step 101 to obtain short video data, step 102 to preprocess short video data, step 103 to extract multi-view feature vectors, and step 104 to establish The short video annotation database, step 105 calculates the similarity of multi-view feature vectors and step 106 transductively solves the multi-view fusion subspace.
步骤101获取短视频数据:从短视频资源网站下载训练样本(即短视频数据),提取信息并存储到缓冲区文件中。实现过程为,首先初始化网站链接和访问权限;初始化查询标签队列;初始化线程数量,生成下载队列,并循环下载视频、评论、简介,并以流的形式写入缓冲区文件,以上为数据抓取引擎采用Python实现的具体实施方式,本发明不对获取短视频数据的方式进行限定。Step 101 Acquire short video data: download training samples (ie short video data) from the short video resource website, extract information and store it in a buffer file. The implementation process is as follows: first initialize the website link and access rights; initialize the query label queue; initialize the number of threads, generate a download queue, and download videos, comments, and introductions in a loop, and write them into buffer files in the form of streams. The above is data capture The engine adopts the specific implementation manner implemented by Python, and the present invention does not limit the method of obtaining short video data.
其中,抓取的训练样本为短视频文件、内容简介、用户评论等多源信息。Among them, the captured training samples are short video files, content introduction, user comments and other multi-source information.
步骤102短视频数据预处理:短视频数据预处理完成训练样本(即短视频数据)和待标注短视频数据的视频处理、音频处理和文本处理,分别生成图像关键帧、音轨和语义标签。其中,图像关键帧和音轨进入步骤103多视角特征向量抽取,语义标签直接进入步骤104建立短视频标注数据库,存储在短视频标注数据库中。Step 102 Short video data preprocessing: Short video data preprocessing completes video processing, audio processing, and text processing of training samples (ie, short video data) and short video data to be labeled, and generates image key frames, audio tracks, and semantic labels respectively. Among them, image key frames and audio tracks enter step 103 for multi-view feature vector extraction, and semantic tags directly enter step 104 to establish a short video annotation database and store them in the short video annotation database.
其中,短视频数据中存在不具有语义标签的短视频数据,通过步骤103多视角特征向量抽取、步骤104建立短视频标注数据库、步骤105计算多视角特征向量相似度和步骤106直推式求解多视角融合子空间对不具有语义标签的短视频数据进行语义标签的标注,完成不具有语义标签的短视频数据的自动标注。Wherein, there is short video data without semantic labels in the short video data, by step 103 multi-view feature vector extraction, step 104 establishment of short video annotation database, step 105 calculation of multi-view feature vector similarity and step 106 transductive solution to multiple The perspective fusion subspace carries out semantic tagging on the short video data without semantic tags, and completes the automatic tagging of short video data without semantic tags.
进一步地,对每一个短视频文件,以每20帧为一个单位,取出第10帧作为关键帧,生成的关键帧序列转换为一致尺寸,如640×360,并以CV_32FC3的数据结构储存到缓冲区文件。Further, for each short video file, take every 20 frames as a unit, take the 10th frame as a key frame, convert the generated key frame sequence to a consistent size, such as 640×360, and store it in the buffer with the data structure of CV_32FC3 zone file.
进一步地,从每一个短视频文件中提取音轨,采用FFmpeg工具以固定16kHz采样率和16bit量化提取音轨。将获得的单通道音频以脉冲编码调制(PCM,Pulse CodeModolation)格式存储到缓冲区文件。Further, extract the audio track from each short video file, and use FFmpeg tool to extract the audio track with a fixed 16kHz sampling rate and 16bit quantization. The obtained single-channel audio is stored in a buffer file in a pulse code modulation (PCM, Pulse Code Modolation) format.
进一步地,对于视频描述和评论内容采用斯坦福大学的NLP TOOLS对视频描述和评论进行分词处理,词干化处理,并筛选停用词,采用WordNet词典进行过滤后,生成初始语义标签。Further, for the video description and comment content, the NLP TOOLS of Stanford University is used to segment the video description and comment, stem the word, and filter the stop words, and use the WordNet dictionary to filter and generate the initial semantic label.
更进一步地,对短视频数据进行预处理,生成图像关键帧、音轨和语义标签,语义标签直接进入步骤104建立短视频标注数据库,存入短视频标注数据库。Further, the short video data is preprocessed to generate image key frames, audio tracks and semantic tags, and the semantic tags directly enter step 104 to establish a short video tagging database and store them in the short video tagging database.
步骤103多视角特征向量抽取:多视角特征向量抽取分别提取短视频数据的视觉视角特征、音频视角特征以及文本视角特征,具体的提取方式如下。Step 103 Multi-view feature vector extraction: the multi-view feature vector extraction extracts visual view features, audio view features, and text view features of the short video data respectively, and the specific extraction methods are as follows.
步骤103多视角特征向量抽取:多视角特征向量抽取分别提取训练集短视频数据和待标注短视频数据的视觉视角特征、音频视角特征以及文本视角特征,具体的提取方式如下。Step 103 Multi-view feature vector extraction: Multi-view feature vector extraction extracts the visual view features, audio view features and text view features of the short video data in the training set and the short video data to be labeled respectively. The specific extraction methods are as follows.
对每一个图像关键帧提取如下视角的视觉特征:(1)5×5分块后的225维颜色矩特征;(2)144维HSV相关图;(3)128维小波纹理;(4)64维HSV直方图;(5)75维边缘分布直方图;(6)16维共生纹理特征;(7)采用文献“Caffe:Convolutional architecture for fastfeature embedding(Jia,Yangqing,Evan Shelhamer,Jeff Donahue,Sergey Karayev,Jonathan Long,Ross Girshick,Sergio Guadarrama,and Trevor Darrell.InProceedings of the 22nd ACM international conference on Multimedia,pp.675-678.ACM,2014)”的方法提取1000维的softmax layer层对象特征;(8)采用文献“Large-scale visual sentiment ontology and detectors using adjective noun pairs.(Borth,Damian,Rongrong Ji,Tao Chen,Thomas Breuel,and Shih-Fu Chang.InProceedings of the 21st ACM international conference on Multimedia,pp.223-232.ACM,2013.)”的方法,应用SentiBank数据集上的情感特征预训练模型提取关键桢的2089维情感特征。For each image key frame, visual features of the following perspectives are extracted: (1) 225-dimensional color moment features after 5×5 blocks; (2) 144-dimensional HSV correlation map; (3) 128-dimensional wavelet texture; (4) 64 (5) 75-dimensional edge distribution histogram; (6) 16-dimensional co-occurrence texture features; (7) using the literature "Caffe: Convolutional architecture for fastfeature embedding (Jia, Yangqing, Evan Shelhamer, Jeff Donahue, Sergey Karayev , Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. In Proceedings of the 22nd ACM international conference on Multimedia, pp.675-678. ACM, 2014)" method to extract 1000-dimensional softmax layer object features; (8) Using the literature "Large-scale visual sentiment ontology and detectors using adjective noun pairs. (Borth, Damian, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. In Proceedings of the 21st ACM international conference on Multimedia, pp.223- 232.ACM,2013.)", using the emotional feature pre-training model on the SentiBank dataset to extract the 2089-dimensional emotional features of key frames.
采用工具Sentence2vec抽取100维度文本描述特征。The tool Sentence2vec is used to extract 100-dimensional text description features.
提取音频视角特征:首先,采用工具FFmpeg以16kHz为采样频率,取20毫秒的间隔进行采样提取音轨的单通道音频,然后选用汉明窗对音频信号进行加窗,把0到8kHz频率带划分为7个子带,并采用文献“Design,analysis and experimental evaluation of blockbased transformation in MFCC computation for speaker recognition.(Sahidullah,Md,and Goutam Saha.Speech Communication 54,no.4,2012:543-565.)”的方法提取14维Mel倒谱参数特征,并提取276维音频能量特征,170维音频频谱质心特征,169维频谱时间特性特征和80维音频签名特征。Extract audio perspective features: First, use the tool FFmpeg to take 16kHz as the sampling frequency, take 20 milliseconds to sample and extract the single-channel audio of the audio track, and then use the Hamming window to window the audio signal to divide the frequency band from 0 to 8kHz There are 7 sub-bands, and the literature "Design, analysis and experimental evaluation of blockbased transformation in MFCC computation for speaker recognition. (Sahidullah, Md, and Goutam Saha. Speech Communication 54, no.4, 2012:543-565.)" The method extracts 14-dimensional Mel cepstrum parameter features, and extracts 276-dimensional audio energy features, 170-dimensional audio spectral centroid features, 169-dimensional spectral time characteristic features and 80-dimensional audio signature features.
进一步地,对步骤103中获得的多视角特征向量存入短视频标注数据库。Further, the multi-view feature vector obtained in step 103 is stored in the short video annotation database.
步骤104建立短视频标注数据库:短视频标注数据库短视频标注数据库具有用户信息表SeedUserTable,特征编码表ViewCodeTable,视频文件表VideoDescriptionTable,多视角特征表MultiviewTable,标签词汇表TagTable,视频标签表VideoTagTable共6个表,具体详见图2中的描述。Step 104 establishes a short video annotation database: the short video annotation database has a user information table SeedUserTable, a feature encoding table ViewCodeTable, a video file table VideoDescriptionTable, a multi-view feature table MultiviewTable, a tag vocabulary TagTable, and a video tag table VideoTagTable, a total of 6 Table, see the description in Figure 2 for details.
进一步地,数据库的实现采用MongoDB NoSQL分布式存储数据库。Further, the implementation of the database adopts MongoDB NoSQL distributed storage database.
进一步地,SeedUserTable、KeyWordsForDownloadTable、ViewCodeTable和TagTable内容需要预先录入数据,由数据抓取引擎调用。Furthermore, the contents of SeedUserTable, KeyWordsForDownloadTable, ViewCodeTable, and TagTable need to be pre-entered with data and called by the data capture engine.
更进一步地,步骤102短视频数据预处理生成的语义标签、步骤103多视角特征向量抽取的视觉特征、音轨视角特征、文本描述特征存储在短视频标注数据库的多视角特征表MultiviewTable中。Furthermore, the semantic tags generated by the short video data preprocessing in step 102, the visual features extracted from the multi-view feature vector in step 103, the audio track view features, and the text description features are stored in the multi-view feature table MultiviewTable of the short video annotation database.
步骤105计算多视角特征向量相似度:根据抽取到的多视角特征,计算每个视角的相似度矩阵,进行拉普拉斯规范化处理,确定优化目标,并进行直推式求解。Step 105 Calculate the similarity of multi-view feature vectors: Calculate the similarity matrix of each view according to the extracted multi-view features, perform Laplacian normalization processing, determine the optimization target, and perform transductive solution.
为了便于描述,给出如下表示符号:For the convenience of description, the following notation is given:
短视频数据的样本集合X={x1,x2,...,xN,xN+1,...,xN+M},其中,在短视频标注数据库中具有所述语义标签的短视频标注数据库的(训练集样本)数量为N,待标注短视频数据测试样本数量为M,对样本集合采用K个视角进行描述。A sample set X of short video data={x 1 , x 2 ,...,x N , x N+1 ,...,x N+M }, wherein the semantic tags are included in the short video annotation database The number of (training set samples) in the short video labeling database is N, the number of test samples of the short video data to be labeled is M, and K perspectives are used to describe the sample set.
为语义标签构成的标签集合(训练集标签矩阵),如果xi具备第j个标签,则Yij=1,否则Yij=0,C标签集合的规模。 is a label set (training set label matrix) composed of semantic labels, if x i has the jth label, then Y ij =1, otherwise Y ij =0, C is the size of the label set.
为随机初始化的预测标签矩阵。 is a randomly initialized predicted label matrix.
其中,K个视角为步骤103多视角特征向量抽取提取的多个视觉特征、音频特征以及文本描述特征。Among them, the K views are the multiple visual features, audio features and text description features extracted in step 103 through multi-view feature vector extraction.
计算各个视角特征向量间的相似度。先求取特征向量之间的距离,再计算特征向量的均值,然后采用高斯核函数相似度计算0到1区间的特征向量相似度,分别构造相应近邻图的相似度矩阵。Calculate the similarity between the feature vectors of each view. First calculate the distance between the eigenvectors, then calculate the mean value of the eigenvectors, and then use the Gaussian kernel function similarity to calculate the similarity of the eigenvectors in the interval from 0 to 1, and construct the similarity matrix of the corresponding neighbor graph.
计算多视角特征向量之间的距离 Calculate the distance between multi-view feature vectors
其中,为第i个视频的第k个视角的特征向量,为向量的第l维向量值;in, is the feature vector of the kth viewing angle of the i-th video, as a vector The l-th dimension vector value of ;
其中,为第j个视频的第k个视角的特征向量,为向量的第l维向量值;mk为第k个视角的特征向量的长度。in, is the feature vector of the kth viewing angle of the jth video, as a vector The l-th dimension vector value of ; m k is the length of the feature vector of the k-th view.
其中,为向量的1范数,为向量的1范数。in, as a vector 1 norm of , as a vector 1 norm of .
计算第k个视角的特征向量均值σk:Calculate the eigenvector mean σ k of the kth view:
第k个视角的相似度矩阵Sk的第i行第j列元素的计算公式如下:The calculation formula of the i-th row and j-th column element of the similarity matrix S k of the k-th viewing angle is as follows:
步骤106直推式求解多视角融合子空间:首先计算K个视角的规范化拉普拉斯相似度矩阵,然后随机初始化预测矩阵f和K个视角的组合系数β,然后采用交替优化策略求解预测矩阵f和组合系数β,直到目标函数收敛。Step 106 Transductively solve the multi-view fusion subspace: first calculate the normalized Laplacian similarity matrix of K views, then randomly initialize the prediction matrix f and the combination coefficient β of K views, and then solve the prediction matrix by using an alternate optimization strategy f and the combination coefficient β until the objective function converges.
计算第k个视角的规范化拉普拉斯相似度矩阵L(Sk)的计算公式如下:The formula for calculating the normalized Laplacian similarity matrix L(S k ) of the k-th viewing angle is as follows:
其中,I是(N+M)×(N+M)阶单位阵,为第k个视角的相似矩阵Sk的度矩阵,即其主对角线元素为Sk的列的和,而非主对角元素为0。Wherein, I is (N+M)×(N+M) order unit matrix, is the degree matrix of the similarity matrix S k of the k-th viewing angle, that is, the main diagonal elements are the sum of the columns of S k , and the non-main diagonal elements are 0.
根据计算得到的拉普拉斯相似度矩阵,计算融合子空间,并推导出测试集M个样本的标签。具体步骤如下:根据拉普拉斯相似度矩阵计算第k个视角的迹归一化矩阵,计算公式如下:According to the calculated Laplacian similarity matrix, the fusion subspace is calculated, and the labels of M samples in the test set are derived. The specific steps are as follows: Calculate the trace normalization matrix of the k-th viewing angle according to the Laplacian similarity matrix, and the calculation formula is as follows:
多视角融合子空间相似度矩阵S0的迹归一化表示公式如下:The trace normalization expression formula of the multi-view fusion subspace similarity matrix S0 is as follows:
其中,为随机初始化的线性组合系数。为了求取预测标签矩阵f,计算如下目标函数:in, The linear combination coefficients initialized randomly. In order to obtain the predicted label matrix f, the following objective function is calculated:
该函数第一项约束训练集视频的预测标签与其原始标签一致。该函数第二项约束多视角融合子空间和原始空间具备等距保持性质。该函数第三项约束相似度大的样本,其标签具备相近性。该函数第四项约束为组合系数的稀疏性约束。为防止过拟合现象产生,目标函数中各个参数采用10-折交叉验证确定,本实施中参数设置如下λ=1,μ=0.01,θ=100, The first term of this function constrains the predicted labels of the training set videos to be consistent with their original labels. The second term of the function constrains the multi-view fusion subspace and the original space to have equidistant properties. The third item of the function constrains samples with high similarity, and their labels are similar. The fourth constraint of this function is the sparsity constraint of the combination coefficient. In order to prevent overfitting, each parameter in the objective function is determined by 10-fold cross-validation. In this implementation, the parameters are set as follows λ=1, μ=0.01, θ=100,
随机初始化预测矩阵f和组合系数β,然后采用交替优化策略求解预测矩阵f和组合系数β,直到目标函数收敛。求解步骤如下。The prediction matrix f and the combination coefficient β are randomly initialized, and then the prediction matrix f and the combination coefficient β are solved by an alternate optimization strategy until the objective function converges. The solution steps are as follows.
首先固定f,求解β。First fix f and solve for β.
目标函数转化为:The objective function is transformed into:
其中, in,
表示的第t列。将替换为依据拉格朗日乘子法,得到如下公式: express The tth column of . Will replace with According to the Lagrange multiplier method, the following formula is obtained:
其中,δ为非负的拉格朗日乘子。求Γ对β的偏导数,得到如下公式:Among them, δ is a non-negative Lagrangian multiplier. Find the partial derivative of Γ with respect to β, and get the following formula:
其中,为正定矩阵。I为K×K的单位矩阵,令Γ对β的偏导数为0,得到β=H-1(δe-μg),将该式带入eTβ=1,得到β的求解公式:in, is a positive definite matrix. I is the unit matrix of K×K, let the partial derivative of Γ with respect to β be 0, and get β=H -1 (δe-μg), put this formula into e T β=1, and get the solution formula of β:
因为H是正定的,所以H-1也是正定矩阵,因此eTH-1e>0。Since H is positive definite, H -1 is also a positive definite matrix, so e T H -1 e>0.
然后,固定β,求解f,具体步骤如下:Then, fix β and solve f, the specific steps are as follows:
Γ对fi(1≤i≤N)求取偏导数,得到:Γ calculates the partial derivative with respect to f i (1≤i≤N), and obtains:
Γ对fi(N+1≤i≤N+M)求取偏导数,得到:Γ calculates the partial derivative with respect to f i (N+1≤i≤N+M), and obtains:
分别令Γ对fi(1≤i≤N)求取的偏导数和Γ对fi(N+1≤i≤N+M)求取的偏导数为0,得到 Respectively let the partial derivative of Γ with respect to f i (1≤i≤N) and the partial derivative of Γ with respect to f i (N+1≤i≤N+M) be 0, and get
其中, in,
其中, in,
重复交替计算所述的“固定f,求解β”和“固定β,求解f”的步骤,直至预测标签矩阵f收敛,得到融合子空间的线性组合系数β和预测标签矩阵f,完成待标注短视频数据的标注。Repeat the steps of "fix f, solve β" and "fix β, solve f" alternately until the predicted label matrix f converges, and the linear combination coefficient β of the fusion subspace and the predicted label matrix f are obtained. Annotation of video data.
图2是本发明实施例的短视频标注数据库的结构示意图。如图2所示,短视频标注数据库具有用户信息表SeedUserTable,特征编码表ViewCodeTable,视频文件表VideoDescriptionTable,多视角特征表MultiviewTable,标签词汇表TagTable,视频标签表VideoTagTable共6个表。视频文件表VideoDescriptionTable包含视频标识VideoID、网络链接标识WebSiteID、视频介绍VideoDescription和视频的存储路径VideoPath,主键PK为VideoID。特征编码表ViewCodeTable包含视角标识ViewID和特征名称ViewName,主键PK为ViewID。多视角特征表MultiViewTable包含视频标识VideoID、视角标识ViewID、多视角特征向量ViewValue,主键PK为VideoID。视频标签表VideoTagTable存储短视频的标签向量,包含视频标识VideoID和标签向列TagVector,主键PK为VideoID。标签词汇表TagTable包含标签标识TagID和标签文本TagText,主键PK为TagID。用户信息表SeedUserTable包含网络链接标识WebSiteID、网站地址WebSite、登录名Username和口令Password,主键PK为WebSiteID。Fig. 2 is a schematic structural diagram of a short video annotation database according to an embodiment of the present invention. As shown in Figure 2, the short video annotation database has six tables: user information table SeedUserTable, feature coding table ViewCodeTable, video file table VideoDescriptionTable, multi-view feature table MultiviewTable, tag vocabulary table TagTable, and video tag table VideoTagTable. Video file table VideoDescriptionTable includes video ID VideoID, network link ID WebSiteID, video introduction VideoDescription and video storage path VideoPath, the primary key PK is VideoID. The feature coding table ViewCodeTable includes the view ID ViewID and the feature name ViewName, and the primary key PK is ViewID. The multi-view feature table MultiViewTable includes the video identifier VideoID, the view ID ViewID, and the multi-view feature vector ViewValue, and the primary key PK is VideoID. The video tag table VideoTagTable stores the tag vector of the short video, including the video identifier VideoID and the tag column TagVector, and the primary key PK is VideoID. The tag vocabulary TagTable contains the tag identifier TagID and the tag text TagText, and the primary key PK is TagID. The user information table SeedUserTable includes the network link identifier WebSiteID, the website address WebSite, the login name Username and the password Password, and the primary key PK is WebSiteID.
图2是本发明实施例的短视频标注数据库的结构示意图。如图2所示,短视频标注数据库具有用户信息表SeedUserTable,特征编码表ViewCodeTable,视频文件表VideoDescriptionTable,多视角特征表MultiviewTable,标签词汇表TagTable,视频标签表VideoTagTable共6个表。视频文件表VideoDescriptionTable包含视频标识VideoID、网络链接标识WebSiteID、视频介绍VideoDescription和视频的存储路径VideoPath,主键PK为VideoID。特征编码表ViewCodeTable包含视角标识ViewID和特征名称ViewName,主键PK为ViewID。多视角特征表MultiViewTable包含视频标识VideoID、视角标识ViewID、多视角特征向量ViewValue,主键PK为VideoID。视频标签表VideoTagTable存储短视频的标签向量,包含视频标识VideoID和标签向列TagVector,主键PK为VideoID。标签词汇表TagTable包含标签标识TagID和标签文本TagText,主键PK为TagID。用户信息表SeedUserTable包含网络链接标识WebSiteID、网站地址WebSite、登录名Username和口令Password,主键PK为WebSiteID。Fig. 2 is a schematic structural diagram of a short video annotation database according to an embodiment of the present invention. As shown in Figure 2, the short video annotation database has six tables: user information table SeedUserTable, feature coding table ViewCodeTable, video file table VideoDescriptionTable, multi-view feature table MultiviewTable, tag vocabulary table TagTable, and video tag table VideoTagTable. Video file table VideoDescriptionTable includes video ID VideoID, network link ID WebSiteID, video introduction VideoDescription and video storage path VideoPath, the primary key PK is VideoID. The feature coding table ViewCodeTable includes the view ID ViewID and the feature name ViewName, and the primary key PK is ViewID. The multi-view feature table MultiViewTable includes the video identifier VideoID, the view ID ViewID, and the multi-view feature vector ViewValue, and the primary key PK is VideoID. The video tag table VideoTagTable stores the tag vector of the short video, including the video identifier VideoID and the tag column TagVector, and the primary key PK is VideoID. The tag vocabulary TagTable contains the tag identifier TagID and the tag text TagText, and the primary key PK is TagID. The user information table SeedUserTable includes the network link identifier WebSiteID, the website address WebSite, the login name Username and the password Password, and the primary key PK is WebSiteID.
图3是本发明实施例的一种基于多源多视角直推式学习的短视频自动标注系统的原理框图。一种基于多源多视角直推式学习的短视频自动标注系统包括:短视频数据获取单元301、短视频数据预处理单元302、提取多视角特征向量单元303、短视频标注数据库单元304、多视角特征向量相似度计算单元305和多视角融合子空间直推式求解单元306。Fig. 3 is a functional block diagram of a short video automatic tagging system based on multi-source multi-view transductive learning according to an embodiment of the present invention. A short video automatic tagging system based on multi-source multi-view transductive learning includes: short video data acquisition unit 301, short video data preprocessing unit 302, multi-view feature vector extraction unit 303, short video tagging database unit 304, multiple A perspective feature vector similarity calculation unit 305 and a multi-view fusion subspace direct calculation unit 306 .
图3中,短视频数据获取单元301,用于获取短视频数据,短视频数据获取单元301获取视频的方法详见图1中步骤101获取短视频数据的描述。In FIG. 3 , the short video data acquisition unit 301 is used to acquire short video data. The method for acquiring video by the short video data acquisition unit 301 is detailed in the description of acquiring short video data in step 101 in FIG. 1 .
图3中,短视频数据预处理单元302,分别与短视频数据获取单元301和短视频标注数据库单元304连接,用于对短视频数据进行预处理,生成一致格式图像关键帧、音轨、文本和语义标签,语义标签存储在数据库中,短视频数据预处理单元302对短视频数据预处理的方法详见图1中步骤102短视频数据预处理的描述。In Fig. 3, the short video data preprocessing unit 302 is connected with the short video data acquisition unit 301 and the short video annotation database unit 304 respectively, and is used to preprocess the short video data to generate consistent format image key frames, audio tracks, text and semantic tags, semantic tags are stored in the database, the short video data preprocessing unit 302 preprocessing method for short video data refer to the description of step 102 short video data preprocessing in Figure 1.
图3中,提取多视角特征向量单元303,分别与短视频数据预处理单元302和短视频标注数据库单元304连接,用于提取所述图像关键帧、音轨和文本的多视角特征向量,提取多视角特征向量单元303的图像关键帧、音轨和文本的多视角特征向量提取方法详见图1中步骤103多视角特征向量抽取的描述。In Fig. 3, extract multi-view feature vector unit 303, connect with short video data preprocessing unit 302 and short video label database unit 304 respectively, be used to extract the multi-view feature vector of described image key frame, audio track and text, extract The multi-view feature vector extraction method of the image key frame, audio track and text of the multi-view feature vector unit 303 is detailed in the description of the multi-view feature vector extraction in step 103 in FIG. 1 .
图3中,短视频标注数据库单元304,用于存储多视角特征向量和所述语义标签,短视频标注数据库单元304的数据库结构详见图2中的描述。In FIG. 3 , the short video annotation database unit 304 is used to store multi-view feature vectors and the semantic tags. The database structure of the short video annotation database unit 304 is detailed in the description in FIG. 2 .
图3中,多视角特征向量相似度计算单元305,与短视频标注数据库单元304连接,用于计算多视角特征向量之间的相似度。In FIG. 3 , the multi-view feature vector similarity calculation unit 305 is connected to the short video annotation database unit 304 for calculating the similarity between the multi-view feature vectors.
图3中,多视角融合子空间直推式求解单元306,与多视角特征向量相似度计算单元305连接,通过多视角特征向量相似度,建立多视角融合子空间,直推式求解所述多视角融合子空间,将语义标签自动标注在待标注短视频数据上。In Fig. 3, the multi-view fusion subspace transductive solution unit 306 is connected with the multi-view feature vector similarity calculation unit 305, through the multi-view feature vector similarity, the multi-view fusion subspace is established, and the multi-view fusion subspace is solved directly. The perspective fusion subspace automatically annotates semantic labels on the short video data to be annotated.
其中,待标注短视频数据为所述短视频数据经过预处理后不具有语义标签的短视频数据。Wherein, the short video data to be marked is the short video data without semantic tags after the short video data has been preprocessed.
其中,短视频数据为视觉图像、音频、内容简介、用户评论组成的多源信息。Among them, the short video data is multi-source information composed of visual images, audio, content introduction, and user comments.
进一步地,多视角特征向量相似度计算单元305包括:多视角特征向量之间的距离计算模块、多视角特征向量的各视角特征向量的均值σk计算模块和多视角特征向量相似度求取模块。Further, the multi-view feature vector similarity calculation unit 305 includes: the distance between the multi-view feature vectors Calculation module, mean value σ k calculation module of each view feature vector of multi-view feature vector and multi-view feature vector similarity calculation module.
多视角特征向量之间的距离计算模块,用于计算多视角特征向量之间的距离 The distance between multi-view feature vectors Calculation module, used to calculate the distance between multi-view feature vectors
其中,为第i个视频的第k个视角的特征向量,为向量的第l维向量值。in, is the feature vector of the kth viewing angle of the i-th video, as a vector The l-th dimension vector value of .
其中,为第j个视频的第k个视角的特征向量,为向量的第l维向量值;mk为第k个视角的特征向量的长度。in, is the feature vector of the kth viewing angle of the jth video, as a vector The l-th dimension vector value of ; m k is the length of the feature vector of the k-th view.
其中,为向量的1范数,为向量的1范数。in, as a vector 1 norm of , as a vector 1 norm of .
多视角特征向量的各视角特征向量的均值σk计算模块,用于计算多视角特征向量的各视角特征向量的均值σk,The mean σ k calculation module of each view feature vector of the multi-view feature vector is used to calculate the mean σ k of each view feature vector of the multi-view feature vector,
多视角特征向量相似度求取模块,分别与多视角特征向量之间的距离计算模块和多视角特征向量的各视角特征向量的均值σk计算模块连接,采用高斯核函数相似度计算0到1区间的所述多视角特征向量之间的相似度,第k个视角的相似度组成第k个视角的相似度矩阵Sk;Multi-view feature vector similarity calculation module, respectively, and the distance between the multi-view feature vector The calculation module is connected to the mean value σ k calculation module of each view feature vector of the multi-view feature vector, and the similarity between the multi-view feature vectors in the 0 to 1 interval is calculated by using the Gaussian kernel function similarity, and the similarity of the kth view is degree to form the similarity matrix S k of the kth viewing angle;
第k个视角的相似度矩阵Sk的第i行第j列元素的计算公式为:The calculation formula of the i-th row and j-th column element of the similarity matrix S k of the k-th viewing angle is:
其中,X={x1,x2,...,xN,xN+1,...,xN+M}为短视频数据的样本集合,具有语义标签的短视频数据的数量为N,待标注短视频样本数量为M,对样本集合采用K个视角进行描述,为第i个视频的第k个视角的特征向量,为第j个视频的第k个视角的特征向量,1≤k≤K,1≤i≤N+M,1≤j≤N+M。Among them, X={x 1 ,x 2 ,...,x N ,x N+1 ,...,x N+M } is a sample set of short video data, and the number of short video data with semantic labels is N, the number of short video samples to be labeled is M, and K perspectives are used to describe the sample set, is the feature vector of the kth viewing angle of the i-th video, is the feature vector of the k-th view of the j-th video, 1≤k≤K, 1≤i≤N+M, 1≤j≤N+M.
进一步地,多视角融合子空间直推式求解单元306进一步包括:多视角融合子空间建立模块、多视角融合子空间求解模块。Furthermore, the multi-view fusion subspace direct calculation unit 306 further includes: a multi-view fusion subspace establishment module and a multi-view fusion subspace calculation module.
多视角融合子空间建立模块:Multi-view fusion subspace building module:
分别计算第k个视角的相似度矩阵Sk的规范化拉普拉斯相似度矩阵L(Sk),1≤k≤K:Calculate the normalized Laplacian similarity matrix L(S k ) of the similarity matrix S k of the k-th viewing angle, 1≤k≤K:
其中,I是(N+M)×(N+M)阶单位阵,为第k个视角的相似矩阵Sk的度矩阵,即其主对角线元素为Sk的列的和,而非主对角元素为0。Wherein, I is (N+M)×(N+M) order unit matrix, is the degree matrix of the similarity matrix S k of the k-th viewing angle, that is, the main diagonal elements are the sum of the columns of S k , and the non-main diagonal elements are 0.
计算规范化拉普拉斯相似度矩阵L(Sk)的迹归一化矩阵:Compute the trace normalization matrix of the normalized Laplacian similarity matrix L(S k ):
多视角融合子空间相似度矩阵S0的迹归一化处理:Trace normalization processing of multi-view fusion subspace similarity matrix S 0 :
其中,为随机初始化的K个视角的线性组合系数。in, is the linear combination coefficient of K views initialized randomly.
多视角融合子空间求解模块,与多视角融合子空间建立模块连接:Multi-view fusion subspace solving module, establish module connection with multi-view fusion subspace:
构建目标函数,求取预测标签矩阵f:Construct the objective function to obtain the predicted label matrix f:
其中,为语义标签构成的标签集合,短视频数据的样本xi具备第j个标签,则Yij=1,否则Yij=0,C是标签集合的规模;in, is a tag set composed of semantic tags, if the sample x i of short video data has the jth tag, then Y ij =1, otherwise Y ij =0, C is the size of the tag set;
其中,为随机初始化的预测标签矩阵;in, is the predicted label matrix initialized randomly;
首先,表示的第t列。将替换为依据拉格朗日乘子法,得到如下公式:first, express The tth column of . Will replace with According to the Lagrange multiplier method, the following formula is obtained:
求Γ对β的偏导数,得到如下公式:Find the partial derivative of Γ with respect to β, and get the following formula:
其中,为正定矩阵。I为K×K的单位矩阵,令Γ对β的偏导数为0,得到β=H-1(δe-μg),将该式带入eTβ=1,得到β的求解公式:in, is a positive definite matrix. I is the unit matrix of K×K, let the partial derivative of Γ with respect to β be 0, and get β=H -1 (δe-μg), put this formula into e T β=1, and get the solution formula of β:
因为H是正定的,所以H-1也是正定矩阵,因此eTH-1e>0。Since H is positive definite, H -1 is also a positive definite matrix, so e T H -1 e>0.
然后,固定β,求解f,具体步骤如下:Then, fix β and solve f, the specific steps are as follows:
Γ对fi(1≤i≤N)求取偏导数,得到:Γ calculates the partial derivative with respect to f i (1≤i≤N), and obtains:
Γ对fi(N+1≤i≤N+M)求取偏导数,得到:Γ calculates the partial derivative with respect to f i (N+1≤i≤N+M), and obtains:
分别令Γ对fi(1≤i≤N)求取的偏导数和Γ对fi(N+1≤i≤N+M)求取的偏导数为0,得到 Respectively let the partial derivative of Γ with respect to f i (1≤i≤N) and the partial derivative of Γ with respect to f i (N+1≤i≤N+M) be 0, and get
其中, in,
其中, in,
重复交替计算所述的“固定f,求解β”和“固定β,求解f”的步骤,直至预测标签矩阵f收敛,得到融合子空间的线性组合系数β和预测标签矩阵f,完成待标注短视频数据的标注。Repeat the steps of "fix f, solve β" and "fix β, solve f" alternately until the predicted label matrix f converges, and the linear combination coefficient β of the fusion subspace and the predicted label matrix f are obtained. Annotation of video data.
其中, in,
其中, in,
其中,表示的第t列;in, express the tth column of
其中,δ为非负的拉格朗日乘子。Among them, δ is a non-negative Lagrangian multiplier.
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Optionally, they can be implemented with program codes executable by a computing device, so that they can be stored in a storage device and executed by a computing device, or they can be made into individual integrated circuit modules, or they can be integrated into Multiple modules or steps are fabricated into a single integrated circuit module to realize. As such, the present invention is not limited to any specific combination of hardware and software.
以上所述实施例仅为表达本发明的实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明专利范围的限制。应当指出的是,对本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形、同等替换、改进等,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments are only to express the implementation of the present invention, and the descriptions thereof are more specific and detailed, but should not be construed as limiting the patent scope of the present invention. It should be noted that those skilled in the art can make several modifications, equivalent replacements, improvements, etc. without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.
Claims (2)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710051411.7A CN106919652B (en) | 2017-01-20 | 2017-01-20 | Short-sighted frequency automatic marking method and system based on multi-source various visual angles transductive learning |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710051411.7A CN106919652B (en) | 2017-01-20 | 2017-01-20 | Short-sighted frequency automatic marking method and system based on multi-source various visual angles transductive learning |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN106919652A CN106919652A (en) | 2017-07-04 |
| CN106919652B true CN106919652B (en) | 2018-04-06 |
Family
ID=59454027
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201710051411.7A Expired - Fee Related CN106919652B (en) | 2017-01-20 | 2017-01-20 | Short-sighted frequency automatic marking method and system based on multi-source various visual angles transductive learning |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN106919652B (en) |
Families Citing this family (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107609570B (en) * | 2017-08-01 | 2020-09-22 | 天津大学 | Micro video popularity prediction method based on attribute classification and multi-view feature fusion |
| CN107610784B (en) * | 2017-09-15 | 2020-10-23 | 中南大学 | Method for predicting relation between microorganisms and diseases |
| CN110209877A (en) * | 2018-02-06 | 2019-09-06 | 上海全土豆文化传播有限公司 | Video analysis method and device |
| CN108897778B (en) * | 2018-06-04 | 2021-12-31 | 创意信息技术股份有限公司 | Image annotation method based on multi-source big data analysis |
| CN109189930B (en) * | 2018-09-01 | 2021-02-23 | 网易(杭州)网络有限公司 | Text feature extraction and extraction model optimization method, medium, device and equipment |
| CN111382620B (en) * | 2018-12-28 | 2023-06-09 | 阿里巴巴集团控股有限公司 | Video tag adding method, computer storage medium and electronic device |
| CN110234038B (en) * | 2019-05-13 | 2020-02-14 | 特斯联(北京)科技有限公司 | User management method based on distributed storage |
| CN110674294A (en) * | 2019-08-29 | 2020-01-10 | 维沃移动通信有限公司 | A kind of similarity determination method and electronic device |
| CN112115299B (en) * | 2020-09-17 | 2024-08-13 | 北京百度网讯科技有限公司 | Video search method, device, recommendation method, electronic device and storage medium |
| CN112800279B (en) * | 2020-12-30 | 2023-04-18 | 中国电子科技集团公司信息科学研究院 | Video-based emergency target information acquisition method, device, equipment and medium |
| CN113033662A (en) * | 2021-03-25 | 2021-06-25 | 北京华宇信息技术有限公司 | Multi-video association method and device |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102508923B (en) * | 2011-11-22 | 2014-06-11 | 北京大学 | Automatic video annotation method based on automatic classification and keyword marking |
| CN102663015B (en) * | 2012-03-21 | 2015-05-06 | 上海大学 | Video semantic labeling method based on characteristics bag models and supervised learning |
| CN103065300B (en) * | 2012-12-24 | 2015-03-25 | 安科智慧城市技术(中国)有限公司 | Method for video labeling and device for video labeling |
| CN105677735B (en) * | 2015-12-30 | 2020-04-21 | 腾讯科技(深圳)有限公司 | Video searching method and device |
| CN106204346A (en) * | 2016-06-30 | 2016-12-07 | 北京文安智能技术股份有限公司 | A kind of movie seat sample automatic marking method based on video analysis, device and electronic equipment |
-
2017
- 2017-01-20 CN CN201710051411.7A patent/CN106919652B/en not_active Expired - Fee Related
Also Published As
| Publication number | Publication date |
|---|---|
| CN106919652A (en) | 2017-07-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106919652B (en) | Short-sighted frequency automatic marking method and system based on multi-source various visual angles transductive learning | |
| CN111259215B (en) | Multimodal subject classification method, device, equipment, and storage medium | |
| US10963504B2 (en) | Zero-shot event detection using semantic embedding | |
| CN103299324B (en) | Using Latent Sub-Tags to Learn Tags for Video Annotation | |
| CN102663015B (en) | Video semantic labeling method based on characteristics bag models and supervised learning | |
| Bhatt et al. | Multimedia data mining: state of the art and challenges | |
| Fu et al. | Advances in deep learning approaches for image tagging | |
| Zhang et al. | A survey on machine learning techniques for auto labeling of video, audio, and text data | |
| Abdar et al. | A review of deep learning for video captioning | |
| Seenivasan | ETL in a World of Unstructured Data: Advanced Techniques for Data Integration | |
| US12288377B2 (en) | Computer-based platforms and methods for efficient AI-based digital video shot indexing | |
| US20250156642A1 (en) | Semantic text segmentation based on topic recognition | |
| Radarapu et al. | Video summarization and captioning using dynamic mode decomposition for surveillance | |
| Yang et al. | Semantic feature mining for video event understanding | |
| CN116975363A (en) | Video tag generation method and device, electronic equipment and storage medium | |
| Sah et al. | Understanding temporal structure for video captioning | |
| Singh et al. | Youtube video summarizer using nlp: A review | |
| Qiu et al. | Liveseg: Unsupervised multimodal temporal segmentation of long livestream videos | |
| Huang et al. | Tag refinement of micro-videos by learning from multiple data sources | |
| Kumar et al. | Video content analysis using deep learning methods | |
| Preethi et al. | Video Captioning using Pre-Trained CNN and LSTM | |
| Tejaswi Nayak et al. | Video retrieval using residual networks | |
| Atencio et al. | Video summarisation by deep visual and categorical diversity | |
| Zhang et al. | Multi-modal tag localization for mobile video search | |
| Kumar et al. | A review of methods for video captioning |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20180406 Termination date: 20220120 |