CN101887439B

CN101887439B - Method and device for generating video abstract and image processing system including device

Info

Publication number: CN101887439B
Application number: CN200910138455.9A
Authority: CN
Inventors: 白洪亮; 孙俊; 胜山裕; 堀田悦伸; 于浩; 直井聪
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2009-05-13
Filing date: 2009-05-13
Publication date: 2014-04-02
Anticipated expiration: 2029-05-13
Also published as: CN101887439A

Abstract

A method for generating a video summary of a video signal is provided, including: a video decoding step, decoding the video signal, so as to obtain a multi-frame video decoding image; a subtitle characteristic obtaining step, obtaining the multi-frame video decoding image contained in At least one of the following characteristics of each subtitle in all subtitles: the duration of the subtitle, the position of the subtitle in the corresponding video decoding image, the character size of the subtitle, the identification code of the subtitle image after optical character recognition (OCR) and Credibility; the video abstract generation step, according to the relationship between the content of the video signal and the characteristics of the subtitles, at least one characteristic of the obtained subtitles is processed in order to generate a content suitable for the content of the video signal Video summary. The method of the invention makes it possible to generate a video digest that accurately reflects the main content of the original video file. A device for generating a video summary of a video signal and an image processing system having the device are also provided.

Description

Method and device for generating video summaries, and image processing system including the device

技术领域 technical field

本发明总体上说涉及视频图像处理的技术领域，更具体而言，涉及生成视频摘要的方法、装置，以及包括该装置的图像处理系统。The present invention generally relates to the technical field of video image processing, and more specifically, relates to a method and device for generating video summaries, and an image processing system including the device.

背景技术 Background technique

视频摘要VS(Video Summarization)是指针对视频文件形成的一个较短的摘要，用于概述该视频文件的主要内容。现在普遍使用两种不同类型的视频摘要。第一种称为“SVS(静态视频摘要，Static Video Summary)”，其由从原始视频文件中抽取或者合成的一系列关键帧(key frame)组成。第二种称为“DVS(动态视频梗概，Dynamic Video Skimming)，其是由一组连续的视频剪辑(video clip)组成的、原始视频文件的缩减版本。Video Summary VS (Video Summarization) refers to a short summary formed for a video file, which is used to summarize the main content of the video file. Two different types of video summarization are commonly used today. The first is called "SVS (Static Video Summary, Static Video Summary)", which consists of a series of key frames extracted or synthesized from the original video file. The second is called "DVS (Dynamic Video Skimming), which is a reduced version of the original video file composed of a group of continuous video clips.

在现有技术的上述形成视频摘要的方法中，视频镜头边界检测技术(shot boundary detection technology)被用来确定SVS的关键帧。视频镜头(shot)是摄像机所摄制的一系列未分割的帧。一个场景(scene)被定义为是集中于所感兴趣的一个对象或者多个对象的一个或者多个相邻的视频镜头的集合。在各视频镜头之间存在若干种不同的转换(transition)，这种转换例如有切入切出(cut)、淡出淡入(fade)、溶出溶入(dissolve)、滑变(wipe)等。视频镜头边界检测技术中的算法包括像素差别、统计差别、直方图、运动矢量，等等。例如，在这种视频镜头边界检测技术中，事先人为地设置在某种情况下设置的帧为关键帧，例如，认为在发生了上述视频镜头之间的切入切出、淡出淡入、溶出溶入、或滑变等转换时涉及的视频镜头及其相关帧反映了原始视频文件的重要信息，因此将这些视频镜头或帧提取出来组成视频摘要。In the above-mentioned method for forming a video summary in the prior art, a video shot boundary detection technology (shot boundary detection technology) is used to determine the key frame of the SVS. A video shot is a series of undivided frames captured by a camera. A scene is defined as a collection of one or more adjacent video shots focused on an object or objects of interest. There are several different transitions between video shots, such as cut, fade, dissolve, and wipe. Algorithms in video shot boundary detection techniques include pixel difference, statistical difference, histogram, motion vector, and so on. For example, in this video shot boundary detection technology, the frame set in a certain situation is artificially set in advance as a key frame. The video shots and their related frames involved in the conversion such as , or slip change reflect the important information of the original video file, so these video shots or frames are extracted to form a video summary.

然而，在大多数视频中，上述的基于视频镜头的SVS包括的视频镜头过多，而且，发生上述转换时涉及的视频镜头未必体现视频文件的主要内容，因此，虽然最终形成的视频摘要包括很多帧图像，但是仍然无法提供原始视频文件的重要信息。例如，在讲述故事的视频中，可能会包括许多视频镜头的转换，这些视频镜头的转换仅仅是为了交代故事发展过程中涉及的时间地点等以确保故事讲述的完整性或者作为情节发展的铺垫，但是这些转换所涉及的视频镜头及其关联帧与故事本身的主要内容没有太大关系，因此如果把这些帧都包括在视频摘要中，将使得无法准确地从该视频摘要中获得故事的概要。However, in most videos, the above-mentioned SVS based on video shots includes too many video shots, and the video shots involved in the above conversion may not necessarily reflect the main content of the video file. Therefore, although the final video summary includes many frame image, but still cannot provide important information about the original video file. For example, in a story-telling video, many video lens transitions may be included, and these video lens transitions are only to explain the time and place involved in the story development process to ensure the integrity of the story telling or as a foreshadowing for the development of the plot. However, the video shots and their associated frames involved in these conversions have little to do with the main content of the story itself, so if these frames are all included in the video summary, it will be impossible to accurately obtain the outline of the story from the video summary.

上述现有技术中的DVS也存在类似的缺陷，即，难以获得准确地反映原始视频文件的主要内容的视频摘要。The above-mentioned DVS in the prior art also has a similar defect, that is, it is difficult to obtain a video summary that accurately reflects the main content of the original video file.

发明内容 Contents of the invention

为了克服上述现有技术中的缺陷，本发明的目的在于提供生成视频摘要的方法、装置，以及包括该装置的图像处理系统，使得生成能准确地反映原始视频文件的主要内容的视频摘要。In order to overcome the defects in the above-mentioned prior art, the object of the present invention is to provide a method and device for generating video summaries, and an image processing system including the device, so as to generate video summaries that can accurately reflect the main content of the original video file.

根据本发明的实施例，提供一种用于生成视频信号的视频摘要的方法，包括步骤：视频解码步骤，用于对视频信号进行解码，以便获得多帧视频解码图像；字幕特性获得步骤，用于获得所述多帧视频解码图像中包含的所有字幕中每一个字幕的如下特性中的至少一种：字幕的持续时间，字幕在相应的视频解码图像中的位置，字幕的字符尺寸，字幕图像经光学字符识别(OCR)后的识别码和可信度；以及，视频摘要生成步骤，用于根据所述视频信号的内容与字幕的特性之间的关系，对所获得的字幕的至少一种特性进行处理，以便生成与所述视频信号的内容相适应的视频摘要。According to an embodiment of the present invention, there is provided a method for generating a video summary of a video signal, comprising the steps of: a video decoding step, for decoding the video signal, so as to obtain a multi-frame video decoding image; a subtitle characteristic obtaining step, using At least one of the following characteristics of each subtitle in all subtitles contained in the multi-frame video decoding image is obtained: the duration of the subtitle, the position of the subtitle in the corresponding video decoding image, the character size of the subtitle, and the subtitle image The identification code and reliability after optical character recognition (OCR); and, a video summary generation step for at least one of the obtained subtitles according to the relationship between the content of the video signal and the characteristics of the subtitles The characteristics are processed in order to generate a video summary adapted to the content of said video signal.

根据本发明的实施例，还提供一种用于生成视频信号的视频摘要的装置，包括：视频解码单元，用于对视频信号进行解码，以便获得多帧视频解码图像；字幕特性获得单元，用于获得所述多帧视频解码图像中包含的所有字幕中每一个字幕的如下特性中的至少一种：字幕的持续时间，字幕在相应的视频解码图像中的位置，字幕的字符尺寸，字幕图像经光学字符识别(OCR)后的识别码和可信度；以及视频摘要生成单元，用于根据所述视频信号的内容与字幕的特性之间的关系，对所获得的字幕的至少一种特性进行处理，以便生成与所述视频信号的内容相适应的视频摘要。According to an embodiment of the present invention, there is also provided a device for generating a video summary of a video signal, including: a video decoding unit, configured to decode the video signal, so as to obtain a multi-frame video decoded image; a subtitle characteristic obtaining unit, using At least one of the following characteristics of each subtitle in all subtitles contained in the multi-frame video decoding image is obtained: the duration of the subtitle, the position of the subtitle in the corresponding video decoding image, the character size of the subtitle, and the subtitle image The identification code and credibility after optical character recognition (OCR); and a video summary generation unit for at least one characteristic of the obtained subtitles according to the relationship between the content of the video signal and the characteristics of the subtitles Processing is performed to generate a video summary adapted to the content of said video signal.

本发明的其他实施例还提供一种视频图像处理系统，其具有根据本发明的如上所述的用于生成视频信号的视频摘要的装置。这种视频图像处理系统例如是电视采集卡、DVD播放器或者膝上型计算机。Other embodiments of the present invention also provide a video image processing system, which has the above-mentioned device for generating a video summary of a video signal according to the present invention. Such a video image processing system is, for example, a TV capture card, a DVD player or a laptop computer.

此外，本发明的其他实施例还提供一种存储有机器可读取的指令代码的程序产品，所述指令代码由机器读取并执行时，可执行如上所述的根据本发明的生成视频信号的视频摘要的方法。In addition, other embodiments of the present invention also provide a program product storing machine-readable instruction codes. When the instruction codes are read and executed by a machine, the video signal generation method according to the present invention as described above can be executed. method of video summarization.

如上所述，在现有技术的视频摘要获取方法中，并不考虑视频文件中所存在的字幕信息与视频文件的内容之间的关系，而本发明的方法正是利用了这种关系来生成视频摘要，因而改善了所获得的视频摘要与原始视频文件的内容之间的关联性，使得视频摘要能准确地反映原始视频文件的主要内容。As mentioned above, in the prior art methods for obtaining video summaries, the relationship between the subtitle information existing in the video file and the content of the video file is not considered, but the method of the present invention utilizes this relationship to generate The video abstract thus improves the correlation between the obtained video abstract and the content of the original video file, so that the video abstract can accurately reflect the main content of the original video file.

附图说明 Description of drawings

参照下面结合附图对本发明实施例的说明，会更加容易地理解本发明的以上和其它目的、特点和优点。附图中的部件不是成比例绘制的，而只是为了示出本发明的原理。在附图中，相同的或类似的技术特征或部件将采用相同或类似的附图标记来表示。The above and other objects, features and advantages of the present invention will be more easily understood with reference to the following description of the embodiments of the present invention in conjunction with the accompanying drawings. The components in the figures are not drawn to scale, merely illustrating the principles of the invention. In the drawings, the same or similar technical features or components will be denoted by the same or similar reference numerals.

图1是示出了根据本发明的实施例的用于生成视频摘要的方法的流程简图；FIG. 1 is a simplified flowchart illustrating a method for generating a video abstract according to an embodiment of the present invention;

图2A是示出了在通过根据本发明的实施例的方法生成视频摘要的过程中所涉及的原始视频文件的一个实例的示意性简图；2A is a schematic diagram illustrating an example of an original video file involved in the process of generating a video summary by a method according to an embodiment of the present invention;

图2B是示出了通过根据本发明的实施例的方法从图2A的原始视频文件中获得的视频摘要的示意性简图；FIG. 2B is a schematic diagram showing a video abstract obtained from the raw video file of FIG. 2A by a method according to an embodiment of the present invention;

图3A是示出了在通过根据本发明的实施例的方法生成视频摘要的过程中所涉及的原始视频文件的另一个实例的示意性简图；3A is a schematic diagram illustrating another example of an original video file involved in the process of generating a video summary by a method according to an embodiment of the present invention;

图3B是示出了通过根据本发明的实施例的方法从图3A的原始视频文件中获得的视频摘要的示意性简图；3B is a schematic diagram showing a video abstract obtained from the original video file of FIG. 3A by a method according to an embodiment of the present invention;

图4是示出了在通过根据本发明的实施例的方法生成视频摘要的过程中用于获得字幕的字符尺寸的方法的一个实例的简化流程图；4 is a simplified flowchart illustrating an example of a method for obtaining a character size of a subtitle in the process of generating a video summary by a method according to an embodiment of the present invention;

图5是示出了根据本发明的实施例用于生成视频摘要的装置的示意框图。Fig. 5 is a schematic block diagram showing an apparatus for generating a video summary according to an embodiment of the present invention.

具体实施方式 Detailed ways

下面参照附图来说明本发明的实施例。在本发明的一个附图或一种实施方式中描述的元素和特征可以与一个或更多个其它附图或实施方式中示出的元素和特征相结合。应当注意，为了清楚的目的，附图和说明中省略了与本发明无关的、本领域普通技术人员已知的部件和处理的表示和描述。Embodiments of the present invention will be described below with reference to the drawings. Elements and features described in one drawing or one embodiment of the present invention may be combined with elements and features shown in one or more other drawings or embodiments. It should be noted that representation and description of components and processes that are not related to the present invention and known to those of ordinary skill in the art are omitted from the drawings and descriptions for the purpose of clarity.

图1是示出了根据本发明的实施例的生成视频摘要的方法的流程简图。如图所示，根据本发明的该实施例的生成视频摘要的方法100开始于步骤S110。在视频解码步骤S120，对原始视频文件中包括的视频信号进行解码，以便获得多帧视频解码图像。在字幕特性获得步骤S130，获得多帧视频解码图像中包含的所有字幕中每一个字幕的如下特性中的至少一种：字幕的持续时间，字幕在相应的视频解码图像中的位置，字幕的字符尺寸，字幕图像经光学字符识别(OCR)后的识别码和可信度。在视频摘要生成步骤S140，根据视频信号的内容与字幕的特性之间的关系，对所获得的字幕的至少一种特性进行处理，以便生成与视频信号的主要内容(即原始视频文件的内容)相适应的视频摘要。FIG. 1 is a simplified flowchart illustrating a method for generating a video summary according to an embodiment of the present invention. As shown in the figure, the method 100 for generating a video abstract according to this embodiment of the present invention starts at step S110. In the video decoding step S120, the video signal included in the original video file is decoded so as to obtain multiple frames of video decoded images. In subtitle characteristic obtaining step S130, at least one of the following characteristics of each subtitle in all subtitles contained in the multi-frame video decoding image is obtained: the duration of the subtitle, the position of the subtitle in the corresponding video decoding image, the character of the subtitle Dimensions, ID code and confidence level of the subtitle image after Optical Character Recognition (OCR). In the video summary generation step S140, according to the relationship between the content of the video signal and the characteristics of the subtitle, at least one characteristic of the obtained subtitle is processed, so as to generate the main content of the video signal (ie the content of the original video file) Compatible video summary.

在视频解码步骤S120中对视频信息进行的解码处理可通过现有技术中的各种视频解码方法来完成。视频解码技术是当前很成熟的图像处理技术，其细节在此不再赘述。作为例子，可由解码器ffdshow来对视频信息进行解码。可通过http://sourceforge.net/projects/ffdshow来获得有关解码器ffdshow的相关信息。The decoding process of the video information in the video decoding step S120 can be completed by various video decoding methods in the prior art. The video decoding technology is a very mature image processing technology at present, and its details will not be repeated here. As an example, the video information may be decoded by the decoder ffdshow. Information about the decoder ffdshow is available at http://sourceforge.net/projects/ffdshow.

在字幕特性获得步骤S130中，对视频解码步骤S120中所获得的多帧视频解码图像中的字幕进行相应处理。一般，字幕信息是在摄像机摄制完成的视频图像中另外添加的，字幕信息通常会反映其所在视频图像帧的相关信息，例如，该帧图像所涉及的主题，该帧图像所表达的内容，等等。字幕例如可包括事件发生时间、地点、当事人、体育比赛的比分、天气预报、商品的价格等等。In the subtitle characteristic obtaining step S130, corresponding processing is performed on the subtitles in the multi-frame video decoding images obtained in the video decoding step S120. Generally, the subtitle information is additionally added to the video image captured by the camera. The subtitle information usually reflects the relevant information of the video image frame where it is located, for example, the subject involved in the frame image, the content expressed in the frame image, etc. wait. Subtitles may include, for example, the event time, location, parties, scores of sports games, weather forecasts, prices of commodities, and the like.

通过对多帧解码视频图像中包含的所有字幕进行检测、跟踪和识别来获得字幕的各种特性。其中，字幕的持续时间可通过各种现有技术中存在的确定字幕出现时间的技术来实现，例如：在发明人为Rainer Wolfgang，Lienhart，Axel Wernicke，发明名称为“Generalized text localization inimages”的美国专利No.6,470,094，以及发明人为Rainer WolfgangLienhart，Axel Wernicke，发明名称为“Estimating text color andsegmentation of images”的美国专利No.6,473,522中所公开的使用基于signature算法跟踪视频中的文本的方法；在发明人为Lu Lie，SunYan-Feng，Li Mingjing，Hua Xian-Sheng，Zhang Hong-Jiang，发明名称为“Automatic detection and segmentation of music videos in an audio/videostream”的美国专利申请公开No.2004/0170392中公开的使用字幕的位置信息确定音乐视频中字幕的开始和结束时间的方法；在发明人为SanghoonSull，Hyeokman Kim，Min Chung，Sangwook Lee，Sangwook Oh，发明名称为“System and method for indexing，searching，identifying，and editingmultimedia files”的美国专利申请公开No.2007/0038612中公开的使用相似颜色和位置信息作为特征来跟踪字幕的方法；在作者为Huiping Li等，名称为“Text enhancement in digital video using multiple frame integration”，ACM Multimedia(pp.19-22，1999)中公开的使用基于SSD(Sum ofSquare Difference)的图像匹配算法来跟踪文本区域的方法；在作者为Xiaoou Tang等，名称为“A spatial-temporal approach for video captiondetection and recognition”，IEEE Transactions on Neural Networks(Vol.13，No.4，pp.961-971，2002)中公开的使用QSDD(Quantized Spatial DifferenceDensity)来检测字幕发生变化的帧的位置的方法；在作者为Takeshi Mita等，名称为“Improvement of Video Recognition by Character Selection”，ICDAR(pp.1089-1093，2001)，以及在发明人为三田雄志等，发明名称为“テロツプ情報処理装置及びテロツプ情報表示装置”的日本专利申请公开JP特開2001-285716中公开的利用识别引擎产生的字符的编码和可信度等比较上层的特征确定连续帧之间的关系的方法；等等。此外，也可通过发明人为白洪亮等、发明名称为“字幕存在时间确定装置和方法”、申请号为200810074125.3的中国专利申中公开的方法来确定字幕的持续时间。Various properties of subtitles are obtained by detecting, tracking and identifying all subtitles contained in multi-frame decoded video images. Wherein, the duration of subtitles can be realized by the technology of determining subtitle appearance time existing in various prior art, for example: in the U.S. patent that the inventor is Rainer Wolfgang, Lienhart, Axel Wernicke, the invention name is "Generalized text localization inimages" No. 6,470,094, and the inventor is Rainer Wolfgang Lienhart, Axel Wernicke, the invention title is "Estimating text color and segmentation of images" U.S. Patent No. 6,473,522 disclosed in the method of using signature algorithm to track the text in the video; the inventor is Lu Lie, SunYan-Feng, Li Mingjing, Hua Xian-Sheng, Zhang Hong-Jiang, U.S. Patent Application Publication No. 2004/0170392 entitled "Automatic detection and segmentation of music videos in an audio/video stream" The position information of the subtitle determines the method of the start and end time of the subtitle in the music video; the inventor is SanghoonSull, Hyeokman Kim, Min Chung, Sangwook Lee, Sangwook Oh, and the invention name is "System and method for indexing, searching, identifying, and editingmultimedia files" U.S. Patent Application Publication No.2007/0038612, which uses similar color and position information as features to track subtitles; in the author Huiping Li et al., titled "Text enhancement in digital video using multiple frame integration", The method disclosed in ACM Multimedia (pp.19-22, 1999) uses an image matching algorithm based on SSD (Sum of Square Difference) to track text regions; the author is Xiaoou Tang, etc., and the name is "A spatial-temporal approach for video caption detection and recognition", IEEE Transactions on Neural The method disclosed in Networks (Vol.13, No.4, pp.961-971, 2002) using QSDD (Quantized Spatial DifferenceDensity) to detect the position of the frame where the subtitle changes; the author is Takeshi Mita et al., and the name is " Improvement of Video Recognition by Character Selection", ICDAR (pp.1089-1093, 2001), and Japanese Patent Application Publication JP Unexamined with the title of "テロツプ information processing device and びテロツプ information display device" in which the inventor is Yushi Mita, etc. 2001-285716 discloses a method for determining the relationship between consecutive frames by using the encoding and credibility of the characters generated by the recognition engine to compare the features of the upper layer; and so on. In addition, the duration of the subtitles can also be determined by the method disclosed in the Chinese patent application with the inventor Bai Hongliang et al., the title of the invention is "Apparatus and Method for Determining Subtitle Existence Time", and the application number is 200810074125.3.

字幕在相应的视频解码图像中的位置也可通过现有技术中各种用于提取字幕区域的方法来实现，例如：在发明人为伊藤清美，新倉康巨，发明名称为“映像種別判定方法、映像種別判定装置及び映像種別判定プログラム”的日本专利公开JP特開2006-53802，发明人为砂川伸一，松林一弘，发明名称为“画像処理装置および方法”的日本专利申请公开JP特開平9-16769，以及作者为Rainer Lienhart等，名称为“Localizing and Segmenting Text in Image and Videos”，IEEETransactions on Circuits and System for Video Technology(Vol.12，No.4，pp.256-268，2002)中公开的基于不同的特征例如，连通域特征、边缘特征等来提取字幕区域的方法；在作者为Yu Zhong等，名称为“AutomaticCaption Localization in Compressed Video”，IEEE Transaction on PatternAnalysis and Machine Intelligence(Vol.22，No.4，pp.385-392，2000)中公开的基于纹理特征来提取字幕区域的方法；在作者为Xiaoou Tang等，名称为“A Spatial-Temporal Approach for Video Caption Detection andRecognition”，IEEE Transactions on Neural Network(Vol.13，No.4，pp.961-971，2002)，以及作者为Toshio Sato等，名称为“Video OCR for DigitalNews Archive”，Workshop on Content-Based Access of Image and VideoDatabases(pp52-60，1998)中公开的识别字幕区域的方法；等等。此外，也可通过发明人为白洪亮等、发明名称为“字幕区域提取装置和方法”、申请号为200710140327.9的中国专利申请中公开的方法确定字幕在相应的视频解码图像中的位置。The position of the subtitle in the corresponding video decoding image can also be realized by various methods for extracting the subtitle area in the prior art, for example: the inventors are Ito Kiyomi and Niakura Yasuju, and the title of the invention is "image category determination method, Japanese Patent Publication JP 2006-53802 of Image Type Judgment Device and Image Type Judgment Program", inventors are Shinagawa Shinichi and Matsubayashi Kazuhiro, Japanese Patent Application Publication JP 9-16769 titled "Image Processing Device および Method" , and the author is Rainer Lienhart, etc., named "Localizing and Segmenting Text in Image and Videos", IEEETransactions on Circuits and System for Video Technology (Vol.12, No.4, pp.256-268, 2002) based on Different features such as connected domain features, edge features, etc. to extract subtitle regions; in the author Yu Zhong et al., the name is "Automatic Caption Localization in Compressed Video", IEEE Transaction on Pattern Analysis and Machine Intelligence (Vol.22, No. 4, pp.385-392, 2000) discloses a method for extracting subtitle regions based on texture features; the author is Xiaoou Tang, etc., and the name is "A Spatial-Temporal Approach for Video Caption Detection and Recognition", IEEE Transactions on Neural Network (Vol.13, No.4, pp.961-971, 2002), and the author is Toshio Sato et al., titled "Video OCR for DigitalNews Archive", Workshop on Content-Based Access of Image and VideoDatabases (pp52-60, 1998) disclosed in the method of identifying subtitle regions; and so on. In addition, the position of the subtitle in the corresponding video decoding image can also be determined by the method disclosed in the Chinese patent application with the inventor Bai Hongliang et al., the title of the invention is "Apparatus and Method for Extracting Subtitle Area", and the application number is 200710140327.9.

关于字幕的字符尺寸的确定方法，可使用现有技术中的各种方法来实现，例如：在作者为Lyu，M.R.；Jiqiang Song；Min Cai，名称为“Acomprehensive method for multilingual video text detection，localization，and extraction”，IEEE Trans.on Circuits and Systems for videotechonology(15(2)，243-255，2005)中公开的方法，其中图7是基于投影的方法；以及，在作者为Xiaoou Tang，Xinbo Gao，Jianzhuang Liu，Hongjiang Zhang，名称为“A spatial-temporal approach for video captiondetection and recognition”，IEEE Transactions on Neural Networks(13(4)，2002，961-971)中公开的方法，其中图9是基于投影的方法。The method for determining the character size of subtitles can be implemented using various methods in the prior art, for example: in the author's article, Lyu, M.R.; Jiqiang Song; Min Cai, named "Acomprehensive method for multilingual video text detection, localization, and extraction", the method disclosed in IEEE Trans.on Circuits and Systems for videotechonology (15(2), 243-255, 2005), in which Figure 7 is a projection-based method; and, in the authors Xiaoou Tang, Xinbo Gao, Jianzhuang Liu, Hongjiang Zhang, named "A spatial-temporal approach for video caption detection and recognition", the method disclosed in IEEE Transactions on Neural Networks (13(4), 2002, 961-971), where Figure 9 is based on projection method.

至于字幕图像经光学字符识别(OCR)后获得识别码和可信度，由于OCR是图像识别处理中非常成熟的技术，因此具体细节在此不再赘述。As for the identification code and credibility of the subtitle image after optical character recognition (OCR), since OCR is a very mature technology in image recognition processing, the specific details will not be repeated here.

在视频摘要生成步骤S140中，根据视频信号的内容与字幕的特性之间的关系来生成与原始视频文件的内容相适应的视频摘要。所谓“相适应”，即“相匹配”，指的是视频摘要准确地概括了原始视频文件(其表现为视频信号)的主要内容。In the video abstract generating step S140, a video abstract suitable for the content of the original video file is generated according to the relationship between the content of the video signal and the characteristics of the subtitle. The so-called "adaptation", that is, "matching", means that the video summary accurately summarizes the main content of the original video file (which is represented as a video signal).

如上所述，视频解码图像中的字幕信息通常会反映视频图像帧的相关信息，例如，该帧图像所涉及的主题，该帧图像所表达的内容，等等。因此，如果基于字幕信息与包括该字幕的视频信号之间的关系来生成视频摘要，则可以生成准确地体现视频信号的主要内容的视频摘要。As mentioned above, the subtitle information in the video decoded image usually reflects the relevant information of the video image frame, for example, the theme involved in the frame image, the content expressed in the frame image, and so on. Therefore, if a video digest is generated based on the relationship between subtitle information and a video signal including the subtitle, a video digest that accurately reflects the main content of the video signal can be generated.

至于视频信号的内容与字幕的特性之间的关系，可以通过各种方式来获取。例如，可以通过事先对一定数量的、各种类型的带有字幕的视频图像进行学习来获得有关视频信号的内容与字幕的特性之间的关系的相关知识，在实际进行视频摘要生成处理时，根据通过预先学习获得的知识来确定视频信号的内容与字幕的特性之间的关系，然后根据所确定的关系来对获得的字幕的各种特性进行相应处理，从而生成与原始视频文件的内容相适应的视频摘要。或者，也可以通过预先获得关于视频信号的内容与字幕的特性之间的关系的信息。然后将该信息作为输入信息与所述视频信号一起提供以用于视频摘要的生成，从而基于这种关系来生成准确地体现视频信号的重要内容的视频摘要。As for the relationship between the content of the video signal and the characteristics of the subtitles, it can be acquired in various ways. For example, relevant knowledge about the relationship between the content of the video signal and the characteristics of the subtitles can be obtained by learning a certain number and various types of video images with subtitles in advance. When actually performing video abstract generation processing, Determine the relationship between the content of the video signal and the characteristics of the subtitles according to the knowledge obtained through pre-learning, and then process the various characteristics of the obtained subtitles according to the determined relationship, so as to generate a Adapted video summary. Alternatively, it is also possible to obtain information on the relationship between the content of the video signal and the characteristics of the subtitle in advance. This information is then provided as input information together with the video signal for generation of a video summary, so that a video summary that accurately reflects the important content of the video signal is generated based on this relationship.

对一定数量的、各种类型的视频图像进行学习以获得相关信息的处理是图像处理技术领域中常用的方法。例如，可以通过学习的过程获得如下信息：解码图像帧中的字幕尺寸越大，则其表示视频信号的主要内容的几率越高；处于解码图像帧中特定位置处的字幕表示视频信号的主要内容的几率较高；等等。通过学习获得的上述信息反映的就是视频信号的内容与字幕之间的关系。当然，这些有关视频信号的内容与字幕的特性之间的关系的信息只是列举，根据具体情况，可以有视频信号的内容与字幕的特性之间的各种不同的关系，在后面将会详细描述。The processing of learning a certain number of various types of video images to obtain relevant information is a commonly used method in the field of image processing technology. For example, the following information can be obtained through the learning process: the larger the subtitle size in the decoded image frame, the higher the probability that it represents the main content of the video signal; the subtitle at a specific position in the decoded image frame represents the main content of the video signal more likely; and so on. The above information obtained through learning reflects the relationship between the content of the video signal and the subtitles. Of course, the information about the relationship between the content of the video signal and the characteristics of the subtitles is just an example. Depending on the specific situation, there may be various relationships between the content of the video signal and the characteristics of the subtitles, which will be described in detail later. .

此外，众所周知，字幕是在视频文件制作完成后附加的信息。附加字幕的主要目的之一也在于帮助观者理解视频文件的主要内容。因此，往往在附加字幕时就已经在一定程度上确定了视频信号的内容与字幕之间的关系。由此，有关这种关系的信息也可以在生成视频摘要的处理进行之前获得。例如，可通过操作者判断或者对带有字幕的视频信号进行针对性分析等方式获得。然后将所确定的有关这种关系的信息作为用于生成视频摘要的处理的输入参数来提供。In addition, as we all know, subtitles are additional information after the video file is produced. One of the main purposes of subtitles is also to help viewers understand the main content of the video file. Therefore, the relationship between the content of the video signal and the subtitles is often determined to a certain extent when the subtitles are added. Thus, information about this relationship can also be obtained before the process of generating a video summary takes place. For example, it can be obtained through operator judgment or targeted analysis of video signals with subtitles. The determined information about this relationship is then provided as an input parameter to the process for generating the video summary.

可见，视频信号的内容与字幕的特性之间总是存在某种特定的关系，而根据本发明的实施例的视频摘要生成方法利用了这种关系，基于这种关系对所获得的字幕的各种特性中的至少一种进行相应的处理，从而生成能准确地体现视频信号的主要内容的视频摘要。据此，根据本发明的这种视频摘要生成方法可称为是“基于字幕的”视频摘要提取方法。It can be seen that there is always a specific relationship between the content of the video signal and the characteristics of the subtitles, and the video abstract generation method according to the embodiment of the present invention utilizes this relationship, and based on this relationship, each of the obtained subtitles Corresponding processing is performed on at least one of these characteristics, so as to generate a video summary that can accurately reflect the main content of the video signal. Accordingly, the video abstract generation method according to the present invention can be called a "subtitle-based" video abstract extraction method.

如上所述，例如，视频信号的内容与字幕之间的关系可以通过事先学习、实时判断的方式获得，或者可以预先获得、然后将有关这种关系的信息与待生成视频摘要的视频信号一起作为根据本发明的视频摘要生成方法的输入参数来使用。当然，除了上面列举的获得视频信号的内容与字幕之间的关系的方法以外，也可以使用任何其他合适的方法，在此不再逐一描述。As mentioned above, for example, the relationship between the content of the video signal and the subtitle can be obtained by learning in advance and judging in real time, or can be obtained in advance, and then the information about this relationship can be used together with the video signal to generate the video abstract as It is used according to the input parameters of the video summarization method of the present invention. Of course, in addition to the methods listed above for obtaining the relationship between the content of the video signal and the subtitles, any other suitable method may also be used, which will not be described one by one here.

在视频摘要生成步骤S140，根据视频信号的内容与字幕的特性之间的关系，对所获得的字幕的至少一种特性进行处理，以便生成与视频信号的内容(即原始视频文件的内容)相适应的视频摘要。根据上述可知，取决于视频信号的内容不同，视频信号的内容与字幕的特性之间的关系可以有各种类型。为了帮助进一步理解根据本发明的本实施例的视频摘要生成方法，下面将结合附图2A-2B，3A-3B详细地描述几个具体实例。In the video abstract generation step S140, according to the relationship between the content of the video signal and the characteristics of the subtitle, at least one characteristic of the obtained subtitle is processed so as to generate Adapted video summary. According to the above, depending on the content of the video signal, the relationship between the content of the video signal and the characteristics of the subtitles can be of various types. In order to help further understand the method for generating a video abstract according to this embodiment of the present invention, several specific examples will be described in detail below with reference to FIGS. 2A-2B and 3A-3B.

图2A是示出了在通过根据本发明的实施例的方法生成视频摘要的过程中涉及的原始视频文件的一个实例的示意性简图。如图2A所示，假设待提取视频摘要的视频信号的内容涉及谈话类型或者新闻类型的节目，例如人物访谈，在本实例中以谈话类型的节目为例。图2A中示出的是通过对视频信号进行解码后得到的多帧解码图像NO.1，...，NO.6，...。以NO.1图像帧为例，其左上角的字幕A表示的是本谈话节目所涉及的主题，其左边中部的字幕B1和右边中部的字幕C1表示的是本谈话节目的参与者的姓名、头衔等，其下部的字幕D1表示的是谈话节目参与者B1或者C1的谈话内容，括号中的时间t1表示NO.1图像帧所处的时间点。其他的解码图像帧具有与NO.1图像帧相同的结构，其细节不再逐一描述。Fig. 2A is a schematic diagram showing an example of an original video file involved in the process of generating a video summary by a method according to an embodiment of the present invention. As shown in FIG. 2A , it is assumed that the content of the video signal to be extracted involves a talk-type or news-type program, such as an interview with a character. In this example, a talk-type program is taken as an example. Fig. 2A shows multi-frame decoded images No. 1, . . . , No. 6, . . . obtained by decoding the video signal. Taking the NO.1 image frame as an example, the subtitle A in the upper left corner represents the topic involved in the talk show, the subtitle B1 in the middle of the left and the subtitle C1 in the middle of the right represent the names of the participants in the talk show, The title, etc., the subtitle D1 below it represents the talk content of the talk show participant B1 or C1, and the time t1 in brackets represents the time point where the NO.1 image frame is located. The other decoded image frames have the same structure as the NO.1 image frame, and the details thereof will not be described one by one.

容易理解，虽然图2A中只示出了其中的6帧解码图像，但这是一种示意，实际上，根据视频信号的大小以及所使用的解码方法的不同，所得到的解码图像的帧数可以是任意数量，在图中以省略号来表示。而且，虽然在该实例中所示出的每一帧图像中都包括有字幕，但是在实际情况中，并不是每一帧图像都必须包括有字幕。这些情况都并不影响根据本发明的实施例的视频摘要生成方法的实现。例如，对于不包括字幕的图像帧，在生产视频摘要时可选择不对其进行提取字幕等相应处理，或者按照传统的视频摘要生成方法对其进行处理。It is easy to understand that although only 6 frames of decoded images are shown in FIG. 2A , this is an illustration. In fact, depending on the size of the video signal and the decoding method used, the number of frames of the decoded image obtained It can be any number, which is indicated by an ellipsis in the figure. Moreover, although subtitles are included in each frame of image shown in this example, in actual situations, not every frame of image must include subtitles. These situations do not affect the implementation of the method for generating a video summary according to the embodiment of the present invention. For example, for an image frame that does not include subtitles, it may choose not to perform corresponding processing such as extracting subtitles when producing a video summary, or to process it according to a traditional video summary generation method.

从图2A中的多帧解码图像可看出：在有关谈话节目的视频信号的整个持续时间中，字幕A的持续时间最长(其还可以包括在图中未示出的其他解码图像的全部或者一部分中)；表示第一谈话参与者的字幕B1存在于NO.1-NO.6的解码图像中，其持续时间(t6-t1)小于字幕A的持续时间；表示第二谈话参与者的字幕C1存在于图像帧NO.1至NO.3之间，其持续时间为(t3-t1)，而从图像帧NO.4之后由表示第三谈话参与者的字幕C2所取代，其持续时间(t6-t4)也小于字幕A的持续时间；表示谈话内容的字幕D1-D6的持续时间比字幕A、B1、C1、C2都短。可见，视频信号与字幕的特性之间存在一种可以认为是“级联式”的关系，即：持续时间最长、位于视频解码图像的左上部、表示谈话节目主题的字幕A被认为代表了该谈话节目的最主要的信息；持续时间较短、位于解码图像的左侧中部或者右侧中部、表示谈话节目参与者的字幕B1、C1、C2被认为代表了该谈话节目的次级重要的信息；而持续时间最短、位于视频解码图像的下部、表示各谈话参与者的具体谈话内容的字幕D1-D6被认为代表了该谈话节目的不太重要的信息。因为谈话节目的主题、参与者等属于谈话节目的重要信息，应该被包括在为该谈话节目的视频信号所生成的视频摘要中，而每一个参与者谈话的具体内容并不是视频摘要中所关注的，不一定必须包括在视频摘要中。因此，字幕A、字幕(B1，C1，C2)、字幕(D1-D6)实际上涉及的是该谈话节目的“级联式”的主要内容。此外，如果想了解某个谈话参与者的具体谈话内容，可根据视频摘要直接检索得到该谈话参与者在视频信号中存在的持续起止时间，然后观看相关的原始视频即可(以下将具体描述)。From the multi-frame decoded images in Fig. 2A, it can be seen that in the entire duration of the video signal of the relevant talk show, the duration of subtitle A is the longest (it can also include all other decoded images not shown in the figure). or in a part); represent that the subtitle B1 of the first talking participant exists in the decoded image of NO.1-NO.6, and its duration (t6-t1) is less than the duration of subtitle A; represent the subtitle B1 of the second talking participant Subtitle C1 exists between image frames NO.1 to NO.3, and its duration is (t3-t1), and after image frame NO.4, it is replaced by subtitle C2 representing the third conversation participant, and its duration is (t6-t4) is also shorter than the duration of subtitle A; the duration of subtitles D1-D6 representing the content of the conversation is shorter than that of subtitles A, B1, C1, and C2. It can be seen that there is a "cascade" relationship between the characteristics of the video signal and the subtitles, that is, the subtitle A with the longest duration, located in the upper left part of the video decoding image, and representing the theme of the talk show is considered to represent the The most important information of the talk show; subtitles B1, C1, and C2 that are short in duration, located in the left middle or right middle of the decoded image, and represent the talk show participants are considered to represent the secondary important information of the talk show information; and the subtitles D1-D6 with the shortest duration, located in the lower part of the video decoding image, and representing the specific conversation content of each conversation participant are considered to represent less important information of the talk show. Because the topic and participants of the talk show are important information of the talk show, they should be included in the video summary generated for the video signal of the talk show, and the specific content of each participant's talk is not the focus of the video summary , does not have to be included in the video summary. Therefore, subtitle A, subtitles (B1, C1, C2), and subtitles (D1-D6) actually relate to the main content of the "cascade" of the talk show. In addition, if you want to know the specific conversation content of a talking participant, you can directly retrieve the starting and ending time of the talking participant in the video signal according to the video summary, and then watch the relevant original video (details will be described below) .

可见，基于上述的视频信号的内容与字幕的特性之间存在的这种级联式的关系，可通过对从解码视频图像中所获得的字幕的字幕的持续时间、字幕在相应的视频解码图像中的位置、字幕图像经光学字符识别后的识别码和可信度等特性进行处理来生成该谈话类型的视频信号的“级联式”视频摘要。以下将参考图2B详细描述生成这种级联式视频摘要的过程。It can be seen that based on the above-mentioned cascading relationship between the content of the video signal and the characteristics of the subtitle, the duration of the subtitle of the subtitle obtained from the decoded video image, the subtitle in the corresponding video decoded image The position in the subtitle image, the identification code and reliability of the subtitle image after optical character recognition are processed to generate a "cascade" video summary of the video signal of this type of conversation. The process of generating such a cascaded video summary will be described in detail below with reference to FIG. 2B .

通过对所获得的字幕的上述特性的处理，将位于视频解码图像的左上部(取决于具体情况，也可以是右上部)、且持续时间为最长的第一持续时间的字幕A确定为第一级字幕，该第一级字幕表示该谈话节目的视频信号所涉及的主题，是最重要的信息。By processing the above-mentioned characteristics of the obtained subtitles, the subtitle A that is located in the upper left part of the video decoded image (or the upper right part depending on the specific circumstances) and whose duration is the longest first duration is determined as the first subtitle A. First-level subtitles, the first-level subtitles represent the topics involved in the video signal of the talk show, and are the most important information.

在包含有第一级字幕A的所有视频解码图像中，将位于视频解码图像的左侧中部或者右侧中部、且持续时间为比第一持续时间要短的第二持续时间的字幕，即，持续时间为(t6-t1)的字幕B1，持续时间为(t3-t1)的字幕C1和持续时间为(t6-t4)的字幕C2，作为第二级字幕。并且，根据对这些第二级字幕经光学字符识别后得到的识别码和可信度所表达的姓氏、头衔(例如职位)或称呼语等，可将第二级字幕划分为不同的子字幕B1，C1和C2，这些不同的子字幕分别用于识别在字幕A的持续时间期间、该谈话节目的不同的谈话参与者，即，可认为是不同的谈话参与者的识别符。Among all video decoded images that contain first-level subtitles A, the subtitles that are located in the left middle or right middle of the video decoded image and whose duration is a second duration shorter than the first duration, that is, The subtitle B1 with a duration of (t6-t1), the subtitle C1 with a duration of (t3-t1), and the subtitle C2 with a duration of (t6-t4) serve as the second-level subtitles. Moreover, according to the surname, title (such as position) or salutation expressed by the identification code obtained after optical character recognition and reliability of these second-level subtitles, the second-level subtitles can be divided into different subtitles B1 , C1 and C2, these different subtitles are respectively used to identify different talking participants of the talk show during the duration of subtitle A, that is, they can be considered as identifiers of different talking participants.

在包含有第一级字幕A以及第二级字幕B1，C1和C2的所有视频解码图像中，将位于视频解码图像的下部、且持续时间为比所述第二持续时间要短的第三持续时间的字幕确定为第三级字幕。其中，第三级字幕中的字幕D1-D3与第二级字幕中的子字幕B1和C1共同存在，则字幕D1-D3被确定为是与表示谈话参与者的子字幕B1和C1相对应的谈话内容。类似地，字幕D4-D6被确定为是与表示谈话参与者的子字幕B1和C2相对应的谈话内容。In all video decoded images containing first-level subtitles A and second-level subtitles B1, C1, and C2, the third duration, which will be located in the lower part of the video decoded image and whose duration is shorter than the second duration The subtitle of time is determined as the third-level subtitle. Wherein, the subtitles D1-D3 in the third-level subtitles coexist with the subtitles B1 and C1 in the second-level subtitles, then the subtitles D1-D3 are determined to correspond to the subtitles B1 and C1 representing the participants of the conversation Conversation. Similarly, subtitles D4-D6 are determined to be the content of the talk corresponding to subtitles B1 and C2 representing the talking participants.

通过对字幕的相应特性的上述处理后，可以选择包含有第一级字幕、第二级字幕和第三级字幕的级联式字幕的至少一帧视频解码图像，作为该谈话节目视频信号的级联式视频摘要。图2B是这种级联式视频摘要的示意图。如图2B所示，选择包括有代表了不同的谈话参与者的解码视频图像帧NO.1，NO.3，NO.4和NO.6，构成视频摘要中有关字幕B1，C1和C2所表示的参与者的部分。虽然在图2B中分别选取了与由字幕C1和C2表示的谈话参与者的加入和退出谈话节目的起止时间点对应的解码图像帧NO.1，NO.3，NO.4和NO.6，但实际上，由于不同谈话者的具体谈话内容并不是生成视频摘要所关注的，因此视频摘要中可以只包括与各不同谈话参与者加入和退出谈话节目的时间点之一相对应的解码图像帧。在这种情况下，如果想要了解某个谈话参与者，例如由字幕C1表示的谈话参与者，的具体谈话内容，则可从视频摘要中检索找到与其相关的解码图像帧(在图2B中为图象帧NO.1或NO.3)，从该图象帧获得其加入和/或退出谈话节目的时间点(t1和/或t3)，就可以利用该时间点作为索引到原始谈话节目的视频信号中找到与由字幕C1表示的谈话参与者的具体谈话内容进行观看或者进行其他处理。当然，除了选取与不同谈话参与者的加入和退出的起止时间点对应的解码图像帧以外，也可以再选取与对应的谈话参与者相关的更多的解码图象帧。这种选择可以是任意的，也可以基于某种标准进行，可根据实际需要进行设置，具体细节在此不再赘述。After the above-mentioned processing of the corresponding characteristics of the subtitles, at least one frame of video decoding image of the cascaded subtitles containing the first-level subtitles, the second-level subtitles and the third-level subtitles can be selected as the level of the talk show video signal syndicated video summaries. FIG. 2B is a schematic diagram of such a cascaded video summary. As shown in Figure 2B, the selection includes the decoded video image frames NO.1, NO.3, NO.4 and NO.6 representing different talking participants, which constitute the relevant subtitles B1, C1 and C2 in the video summary part of the participants. Although in Fig. 2B, the decoded image frames NO.1, NO.3, NO.4 and NO.6 corresponding to the start and end time points of joining and exiting the talk show of the talking participants represented by subtitles C1 and C2 have been chosen respectively, But in fact, since the specific conversation content of different talkers is not the focus of generating the video summary, the video summary may only include the decoded image frames corresponding to one of the time points when different talking participants join and exit the talk show . In this case, if one wants to know the specific conversation content of a certain conversation participant, such as the conversation participant represented by subtitle C1, the decoded image frame related to it can be found from the video summary (in Fig. 2B Be image frame NO.1 or NO.3), obtain its joining and/or exit the time point (t1 and/or t3) of talk show from this image frame, just can utilize this time point as index to original talk show The specific content of the conversation with the conversation participant represented by the subtitle C1 is found in the video signal for viewing or other processing. Of course, in addition to selecting decoded image frames corresponding to the start and end time points of joining and quitting of different talking participants, more decoding image frames related to corresponding talking participants may also be selected. This selection can be arbitrary, or based on a certain standard, and can be set according to actual needs, and the specific details will not be repeated here.

可选择地，可以从视频摘要中选择任意一帧图像(图2B中树状图中根结点位置处的图像帧NO.1)来代表该谈话节目。该帧图像可作为视频摘要的索引，例如在生成有多个视频摘要的情况下搜索相应的视频摘要时使用。Optionally, any frame of image (image frame NO.1 at the position of the root node in the dendrogram in FIG. 2B ) may be selected from the video summary to represent the talk show. The frame image can be used as an index of the video summary, for example, it is used when searching for a corresponding video summary when multiple video summaries are generated.

从上面的描述可知，在根据本发明该实例所形成的“级联式”视频摘要显示了在解码图像帧中包含的各字幕之间的重要性程度的级联，即，字幕A、字幕(B1、C1、C2)、字幕(D1-D6)在体现视频信号的主要内容的方面的重要性按降序排列，其构成一种可称为是“级联式”的字幕。当然，取决于具体的视频信号以及字幕，这种“级联式”字幕在体现视频信号的主要内容的方面的重要性也可按升序排列。容易理解，虽然图2A-2B中只涉及不同的三个谈话参与者，但是，谈话参与者可以是任意数量。如果谈话参与者只有一个(例如新闻播报)，则可以任意选取至少一个解码图像帧来组成视频摘要。这种选取也可以基于某种标准，例如，可每隔一定时间段选择一帧图像来构成视频摘要。这种设置可根据实际需要进行，具体细节在此不再赘述。如果谈话参与者是三个以上，则如图2B所示的级联式视频摘要可以具有表示各个不同的谈话参与者的更多的分支(如图2B中树状图中子节点所示)。As can be seen from the above description, the "cascaded" video summary formed according to this example of the present invention shows the cascading of the importance levels between the subtitles contained in the decoded image frame, that is, subtitle A, subtitle ( B1, C1, C2), and subtitles (D1-D6) are arranged in descending order of importance in terms of reflecting the main content of the video signal, which constitute a subtitle that can be called "cascaded". Of course, depending on the specific video signal and subtitles, the importance of such "cascaded" subtitles in reflecting the main content of the video signal can also be arranged in ascending order. It is easy to understand that although only three different conversation participants are involved in FIGS. 2A-2B , there can be any number of conversation participants. If there is only one talking participant (such as a news broadcast), at least one decoded image frame can be arbitrarily selected to form a video summary. This selection may also be based on certain criteria, for example, a frame of images may be selected at regular intervals to form a video summary. This setting can be performed according to actual needs, and the specific details will not be repeated here. If there are more than three talking participants, the cascaded video summary as shown in FIG. 2B may have more branches representing different talking participants (shown as child nodes in the tree diagram in FIG. 2B ).

此外，虽然图2A-2B示出的具体实例给出的级联式字幕具有三级结构，但是，取决于字幕的内容与待生成视频摘要的视频信号的内容之间的关系，可以根据需要在视频摘要中形成具有更多级或者更少级的级联式字幕。即，可以存在N级“级联式”字幕，N是正整数。例如，如果需要区分不同谈话者的谈话，则可用字幕的颜色、大小等其他特性的信息来进行这种区分。该通过利用颜色、大小等特性来确定的、用于区分不同谈话参与者的谈话内容的字幕可认为是级联式字幕中的第四级字幕。例如，在如图2B所示的视频摘要中，可在子节点位置处的解码图像帧NO.1的下面再分支出其他解码图像帧，用于区别分别包括由字幕B1和C1表示的谈话参与者的谈话内容的解码图像。In addition, although the concatenated subtitles shown in the specific examples shown in FIGS. Concatenated subtitles with more or fewer levels are formed in the video summary. That is, there may be N levels of "cascaded" subtitles, N being a positive integer. For example, if it is necessary to distinguish the utterances of different speakers, the information of other characteristics such as the color and size of the subtitles can be used for this distinction. The subtitles determined by using characteristics such as color and size and used to distinguish the conversation contents of different conversation participants can be regarded as the fourth-level subtitles in the cascaded subtitles. For example, in the video summary as shown in Figure 2B, other decoded image frames can be branched under the decoded image frame NO. decoded images of the speaker's speech.

下面参考图3A-3B描述通过对字幕特性中的其他特性进行处理来生成视频信号的视频摘要的另一个具体实例。Another specific example of generating a video digest of a video signal by processing other characteristics of subtitle characteristics is described below with reference to FIGS. 3A-3B .

图3A是示出了在通过根据本发明的实施例的方法生成视频摘要的过程中涉及的原始视频信号的另一个实例的示意性简图。在某些视频信号中，视频图像中包括的字幕与视频信号的内容之间存在这样一种关系：字幕位于解码视频图像中的固定位置并且涉及视频信号的主要内容。在这种情况下，将位于固定位置处的字幕进行比较，确定字幕的图像发生变化的时间点处的字幕，并且由包含所确定的字幕的视频解码图像来组成视频信号的视频摘要。Fig. 3A is a schematic diagram showing another example of an original video signal involved in the process of generating a video summary by a method according to an embodiment of the present invention. In some video signals, there is such a relationship between the subtitles included in the video image and the content of the video signal that the subtitles are located at a fixed position in the decoded video image and relate to the main content of the video signal. In this case, subtitles located at fixed positions are compared, a subtitle at a time point at which an image of the subtitle changes is determined, and a video digest of a video signal is composed from a video decoded image including the determined subtitle.

例如，假设待生成视频摘要的视频信号的内容涉及比赛。图3A中示出的是从该视频信号解码得到的多帧解码视频图像NO.1，...，NO.6，...。容易理解，虽然图中只示出了其中的6帧解码图像，但这是一种示意，实际上，根据视频信号的大小以及所使用的解码方法的不同，所得到的解码图像的帧数可以是任意数量。以NO.1图像帧为例，位于固定位置处，即右下方的字幕S1是显示比赛的比分信息的比分牌。比分牌中通常包括比赛类型、参赛者、具体比分等信息。其他的解码图像帧具有与NO.1图像帧相同的结构，其细节不再逐一描述。在图2A所示出的解码图像帧中，解码图像帧NO.1-NO.2涉及的比分与字幕S1相关，解码图像帧NO.3-NO.5涉及的比分与字幕S2相关，解码图像帧NO.6涉及的比分与字幕S3相关。For example, it is assumed that the content of the video signal for which the video abstract is to be generated relates to a game. Fig. 3A shows multi-frame decoded video images No. 1, . . . , No. 6, . . . obtained by decoding the video signal. It is easy to understand that although only 6 frames of decoded images are shown in the figure, this is an illustration. In fact, depending on the size of the video signal and the decoding method used, the number of frames of the decoded images obtained can vary. is any number. Taking image frame No. 1 as an example, the subtitle S1 located at a fixed position, that is, the lower right subtitle S1 is a score board displaying the score information of the game. The score board usually includes information such as the type of game, the participants, and the specific score. The other decoded image frames have the same structure as the NO.1 image frame, and the details thereof will not be described one by one. In the decoded image frames shown in Figure 2A, the scores involved in the decoded image frames NO.1-NO.2 are related to subtitle S1, the scores involved in the decoded image frames No.3-NO.5 are related to subtitle S2, and the decoded image The score involved in frame No. 6 is related to subtitle S3.

在比赛中，比分牌的外观和在各解码图像帧中的位置一般是保持不变的，而具体比分变化情况的信息体现了比赛类型的视频信号的主要内容。因此，在生成视频摘要的过程中，将位于某个固定位置处的所有比分牌字幕提取出来并进行比较。确定比分牌字幕的图像发生变化的时间点处的比分牌字幕S1，S2，S3，并且由包含所确定的比分牌字幕的视频解码图像帧NO.1(或NO.2)、NO.3(或NO.5)、NO.6来组成所述视频信号的视频摘要，如图3B所示的。In a game, the appearance of the score board and its position in each decoded image frame generally remain unchanged, while the information on specific score changes reflects the main content of the game-type video signal. Therefore, in the process of generating video summaries, all scoreboard subtitles located at a certain fixed position are extracted and compared. Determine the score board subtitle S1 at the point in time when the image of the score board subtitle changes, S2, S3, and by the video decoding image frame NO.1 (or NO.2), NO.3 ( Or NO.5), NO.6 to form the video summary of the video signal, as shown in Figure 3B.

容易理解，虽然图3A-3B中比分牌字幕位于每一帧解码图像的右下角，但是，根据实际情况，比分牌也可以位于其他任意合适的位置。此外，虽然在该实例中是通过提取固定位置处的字幕、并对字幕的图像进行比较来找到比分发生变化的情况下的解码图像，用以生成视频摘要，但是，也可以对所提取的比分牌字幕进行OCR处理，通过对OCR结果进行比较来确定发生了比分变化的比分牌字幕，用以生成视频摘要。It is easy to understand that although the subtitle of the scoreboard in Fig. 3A-3B is located at the lower right corner of each frame of the decoded image, the scoreboard may also be located at any other suitable position according to the actual situation. In addition, although in this example, the decoded image in the case where the score changes is found by extracting the subtitle at a fixed position and comparing the images of the subtitle to generate a video summary, the extracted score can also be The card subtitles are processed by OCR, and the score card subtitles with score changes are determined by comparing the OCR results to generate video summaries.

上面的图2A-2B，图3A-3B描述了通过基于视频信号的内容与字幕的特性之间的关系来对字幕的各种不同特性的处理从而生成视频信息的视频摘要的两个具体实例。当然，也可以根据字幕的特性与待生成视频摘要的视频信号的内容之间的其他关系来确定处理字幕的各种特性中的哪些特性。下面是一些具体实例。Figures 2A-2B and Figures 3A-3B above describe two specific examples of generating video summaries of video information by processing various characteristics of subtitles based on the relationship between the content of the video signal and the characteristics of the subtitles. Of course, which of the various characteristics of the subtitles to process can also be determined according to other relationships between the characteristics of the subtitles and the content of the video signal for which the video abstract is to be generated. Below are some specific examples.

在一个具体实例中，视频信号的内容与字幕的特性之间存在这样的关系：字符尺寸较大的字幕涉及所述视频信号的较为重要的内容。在很多视频图像中，用字幕的大小来表示字幕所代表的内容的重要性程度是比较常见的。在这种情况下，可以获得所有字幕中持续时间最长的一部分字幕(例如％K)的平均字符尺寸。确定所有字幕中字符尺寸大于该平均字符尺寸的字幕，并且由包含所确定的字幕的视频解码图像来组成所述视频信号的视频摘要。In a specific example, there is a relationship between the content of the video signal and the characteristics of the subtitles: subtitles with a larger character size relate to more important content of the video signal. In many video images, it is common to use the size of subtitles to represent the importance of the content represented by the subtitles. In this case, the average character size of a part of subtitles (for example, %K) with the longest duration among all subtitles can be obtained. The subtitles whose character size is larger than the average character size among all the subtitles are determined, and the video digest of the video signal is composed from video decoded images containing the determined subtitles.

在另一个具体实例中，视频信号的内容与字幕的特性之间存在这样的关系：字符尺寸较大并且字幕持续时间较长的字幕涉及所述视频信号的较为重要的内容。在这种情况下，可预先设定字幕的字符尺寸的尺寸阈值H以及字幕的持续时间的时间阈值T，确定字符尺寸大于H并且持续时间大于T的字幕，并且由包含所确定的字幕的视频解码图像来组成所述视频信号的视频摘要。In another specific example, there is a relationship between the content of the video signal and the characteristics of the subtitles: a subtitle with a larger character size and a longer subtitle duration relates to more important content of the video signal. In this case, the size threshold H of the character size of the subtitle and the time threshold T of the duration of the subtitle can be preset, the subtitle whose character size is larger than H and the duration is larger than T is determined, and the video containing the determined subtitle The pictures are decoded to form a video digest of the video signal.

在又另一个具体实例中，视频信号的内容与字幕的特性之间的存在这样的关系：字幕持续时间较长并且位于视频解码图像中特定位置处的字幕涉及所述视频信号的较为重要的内容。在这种情况下，预先设定字幕的持续时间的时间阈值T，确定字幕的持续时间大于T并且位于视频解码图像的特定位置处的字幕，并且由包含所确定的字幕的视频解码图像来组成所述视频信号的视频摘要。In yet another specific example, there is such a relationship between the content of the video signal and the characteristics of the subtitles: the subtitles have a longer duration and are located at a specific position in the video decoded image and relate to more important content of the video signal . In this case, the time threshold T of the duration of subtitles is set in advance, the subtitles whose duration is longer than T and located at a specific position of the video decoded image are determined, and are composed of the video decoded image containing the determined subtitle A video digest of the video signal.

在再一个具体实例中，视频信号的内容与字幕的特性之间存在这样的关系：字符尺寸较大并且位于视频解码图像中特定位置处的字幕涉及所述视频信号的较为重要的内容。在这种情况下，预先设定字幕的字符尺寸的尺寸阈值H，确定字符尺寸大于H并且位于视频解码图像的所述特定位置处的字幕，并且由包含所确定的字幕的视频解码图像来组成所述视频信号的视频摘要。In yet another specific example, there is a relationship between the content of the video signal and the characteristics of the subtitles: subtitles with larger character sizes and located at specific positions in the video decoded image relate to more important content of the video signal. In this case, the size threshold H of the character size of the subtitle is set in advance, the subtitle whose character size is larger than H and located at the specific position of the video decoded image is determined, and is composed of the video decoded image containing the determined subtitle A video digest of the video signal.

在另一个具体实例中，视频信号的内容与字幕的特性之间存在这样的关系：位于视频解码图像中特定位置处并且字幕经光学字符识别后的可信度较高的字幕涉及所述视频信号的较为重要的内容。在这种情况下，可预先设定字幕经光学字符识别后的可信度的阈值L，确定位于视频解码图像中所述特定位置处并且字幕经光学字符识别后的可信度大于L的字幕，并且由包含所确定的字幕的视频解码图像来组成所述视频信号的视频摘要。In another specific example, there is a relationship between the content of the video signal and the characteristics of the subtitles: the subtitles located at a specific position in the video decoded image and whose subtitles have a higher reliability after optical character recognition relate to the video signal more important content. In this case, the threshold L of the reliability of subtitles after optical character recognition can be preset, and the subtitles located at the specific position in the video decoding image and whose reliability after optical character recognition of subtitles is greater than L can be determined. , and form a video summary of the video signal from video decoded images containing the determined subtitles.

在另一个具体实例中，视频信号的内容与字幕的特性之间存在这样的关系：字符尺寸较大并且字幕经光学字符识别后的可信度较高的字幕涉及所述视频信号的较为重要的内容。在这种情况下，预先设定字幕的字符尺寸的尺寸阈值H，以及设定字幕经光学字符识别后的可信度的阈值L，确定字符尺寸大于H并且经光学字符识别后的可信度大于L的字幕，并且由包含有所确定的字幕的视频解码图像来组成所述视频信号的视频摘要。In another specific example, there is a relationship between the content of the video signal and the characteristics of the subtitles: subtitles with larger character sizes and higher reliability of subtitles after optical character recognition relate to more important parts of the video signal content. In this case, pre-set the size threshold H of the character size of the subtitle, and set the threshold L of the credibility of the subtitle after OCR, and determine the credibility of the character size greater than H and OCR subtitles greater than L, and a video digest of the video signal is composed of video decoded pictures containing the determined subtitles.

虽然上面列举了一些通过对字幕的特性进行处理来生成视频摘要的具体实例，但是，本领域技术人员理解，由于实际情况中视频信号与字幕之间的关系是多种多样的，因此，也可以与这种关系对应地采取各种适当的方式来处理字幕的特性。这可以由本领域技术人员根据具体情况进行相应的设置，在此不再赘述。例如，可以通过对所有字幕进行OCR的结果进行判断，从而根据字幕中所包含的具体内容来确定选择哪些字幕，进而确定选择哪些图象帧来生成视频摘要。Although some specific examples of generating video summaries by processing the characteristics of subtitles have been listed above, those skilled in the art understand that since the relationship between video signals and subtitles is varied in actual situations, it is also possible to Corresponding to this relationship, various appropriate methods are adopted to handle the characteristics of subtitles. This can be set correspondingly by those skilled in the art according to specific situations, which will not be repeated here. For example, it is possible to determine which subtitles to select according to the specific content contained in the subtitles, and then determine which image frames to select to generate a video summary by judging the results of OCR for all subtitles.

容易理解，由于字幕与视频信号的内容联系较为紧密，所以通过对视频解码图像中包括的所有字幕的特性进行处理来生成视频摘要，能够准确地反映视频信号的主要内容。从某种意义上说，根据本发明的这种基于字幕的视频摘要生成方法也可以称为是“基于内容的”视频摘要生成方法，因为其是基于在某种程度上体现了视频信号的内容的、视频信息的内容与字幕的特性之间的关系。当然，并不需要对解码视频图像本身的内容进行判断和处理。It is easy to understand that since the subtitles are closely related to the content of the video signal, the video summary can be generated by processing the characteristics of all the subtitles included in the video decoded image, which can accurately reflect the main content of the video signal. In a sense, the subtitle-based video summary generation method according to the present invention can also be called a "content-based" video summary generation method, because it is based on the content of the video signal to some extent. The relationship between the content of video information and the characteristics of subtitles. Of course, it is not necessary to judge and process the content of the decoded video image itself.

在上面提及的字幕的特性中包括字幕的字符尺寸。如上所述，可以通过各种现有的方法来获得字幕字符的尺寸。作为对现有技术的补充和改进，下面参照图4介绍一种获得字幕字符的尺寸的优选方法。The character size of the subtitle is included in the above-mentioned characteristics of the subtitle. As mentioned above, the size of subtitle characters can be obtained through various existing methods. As a supplement and improvement to the prior art, a preferred method for obtaining the size of subtitle characters is introduced below with reference to FIG. 4 .

图4是示出了在通过根据本发明的实施例的方法生成视频摘要的过程中用于获得字幕字符的尺寸的方法400的一个实例的简化流程图。在此实例中，字幕字符尺寸为字幕字符的高度。如图4所示，在步骤S405，对字幕图像进行处理以产生字幕的二值化图像及其图像特征，其中图像高为H，宽为W，例如，可令笔划像素的值为255，非笔划像素的值为0。可以利用各种已知的图像处理方式来获得字幕图像特征。例如，可利用作者为Canny，J.，A，名称为“Computational Approach To Edge Detection”，IEEETrans.Pattern Analysis and Machine Intelligence(8：679-714，1986)中公开的方法来进行这种字幕图像特征提取处理。FIG. 4 is a simplified flowchart illustrating an example of a method 400 for obtaining the size of subtitle characters in the process of generating a video summary by a method according to an embodiment of the present invention. In this example, the subtitle character size is the height of the subtitle character. As shown in Figure 4, in step S405, the subtitle image is processed to produce the binarized image of the subtitle and its image features, wherein the image height is H, and the width is W. For example, the value of the stroke pixel can be 255, which is not Stroke pixels have a value of 0. Various known image processing methods can be used to obtain subtitle image features. For example, the method disclosed in Canny, J., A, titled "Computational Approach To Edge Detection", IEEE Trans. Pattern Analysis and Machine Intelligence (8: 679-714, 1986) can be used to perform such subtitle image characterization Extract processing.

在步骤S410，确定字幕的图像的高度H是否小于或等于预先定义的字符高度MIN_CS的2倍。如果确定结果为“是”，则表明字幕是单行。在步骤S4150中将图像高度H确定为字幕的字符高度。否则，进入步骤S420。In step S410, it is determined whether the height H of the subtitle image is less than or equal to twice the predefined character height MIN_CS. If the determination result is "Yes", it indicates that the subtitle is a single line. The image height H is determined as the character height of the subtitle in step S4150. Otherwise, go to step S420.

在步骤S420中，对字幕的图像进行投影处理。具体而言，对于水平字幕的图像进行水平投影，在水平方向对字幕图像的笔划像素值累加，累加值存储在Hist(i)中，其中i是索引值，表示分割点，且i＝1，2，....H；或者，对于垂直字幕的图像进行垂直投影，在垂直方向对字幕图像的笔划像素值累加，累加值存储在Hist(i)中，其中i是索引值，表示分割点，且i＝1，2，....H。对图像进行投影的处理可通过各种已知的方法实现，例如：在作者为Lyu，M.R.；Jiqiang Song；Min Cai，名称为“A comprehensive method formultilingual video text detection，localization，and extraction”，IEEE Trans.on Circuits and Systems for video techonology(15(2)，243-255，2005)的文献中公开的方法，其中图7是基于投影的方法；以及，在作者为Xiaoou Tang，Xinbo Gao，Jianzhuang Liu，Hongjiang Zhang，名称为“A spatial-temporalapproach for video caption detection and recognition”，IEEE Transactionson Neural Networks(13(4)，2002，961-971)的文献中公开的方法，其中图9是基于投影的方法。In step S420, the subtitle image is projected. Specifically, carry out horizontal projection for the image of the horizontal subtitle, accumulate the stroke pixel value of the subtitle image in the horizontal direction, and the accumulated value is stored in Hist(i), wherein i is an index value, representing a segmentation point, and i=1, 2,....H; or, for the vertical projection of the image of the vertical subtitle, the stroke pixel value of the subtitle image is accumulated in the vertical direction, and the accumulated value is stored in Hist(i), where i is the index value, indicating the segmentation point , and i=1, 2, ... H. The processing of projecting images can be realized by various known methods, for example: in the author is Lyu, M.R.; Jiqiang Song; Min Cai, titled "A comprehensive method formullingual video text detection, localization, and extraction", IEEE Trans .on Circuits and Systems for video technology (15(2), 243-255, 2005), the method disclosed in the literature, wherein Figure 7 is a method based on projection; and, in the authors Xiaoou Tang, Xinbo Gao, Jianzhuang Liu, Hongjiang Zhang, named "A spatial-temporal approach for video caption detection and recognition", the method disclosed in the literature of IEEE Transactions on Neural Networks (13(4), 2002, 961-971), where Figure 9 is a projection-based method.

流程然后进入步骤S425，在此确定字幕是否能够被分成多行。如果Hist(i)＜＝T，表示字幕在位置i处可分，i是正整数。记录所有的可能的可分位置的位置坐标值i1，i2，...ij，并且在步骤S430中计算可分成的各行字幕图像的高度(i2-i1)，(i3-i2)，...，(ij-i(j-1))的平均值作为字幕的字符高度。其中，j是小于或者等于NC的正整数，NC是分割点的个数，T是预定常数，i，j的值取决于字幕是否可分的情况，T的值可根据实际情况预先适当地设置。另一方面，当步骤S425的确定结果是Hist(i)＞T时，表示该字幕不可分，即，存在字符粘连的情况，流程于是进入步骤S435。The flow then goes to step S425, where it is determined whether subtitles can be divided into multiple lines. If Hist(i)<=T, it means that the subtitle is separable at position i, where i is a positive integer. Record the position coordinate values i1, i2, ... ij of all possible separable positions, and calculate the height (i2-i1), (i3-i2), ... , the average value of (ij-i(j-1)) is used as the character height of the subtitle. Among them, j is a positive integer less than or equal to NC, NC is the number of segmentation points, T is a predetermined constant, the value of i and j depends on whether the subtitle can be divided, and the value of T can be properly set in advance according to the actual situation . On the other hand, when the determined result of step S425 is Hist(i)>T, it means that the subtitle is inseparable, that is, there is a case of character conglutination, and the flow then goes to step S435.

在步骤S435中，对于字幕的图像特征进行连通域(Connectedcomponent analysis，CCA)分析。CCA分析是提取粘连的字符时常用的方法，在此省略其具体细节的描述。经CCA分析后得到若干个字幕字符组件(CC)，估计字幕字符组件(CC)高度的平均值为H1。In step S435, a connected component analysis (CCA) analysis is performed on the image features of the subtitles. CCA analysis is a common method for extracting cohesive characters, and its detailed description is omitted here. Several subtitle character components (CC) are obtained after CCA analysis, and the average height of subtitle character components (CC) is estimated to be H1.

在步骤S440，对字幕的CC进行对齐操作，对齐后得到新的字幕字符组件CC_New，估计新的CC_New的高度的平均值为H2。这种对齐操作的具体实施方式如下：首先求取每个CC的矩形中心，例如用矩形中心的坐标值yn表示。对于任意两个CC的矩形中心y1，y2(y1，y2为矩形中心坐标值)，如果|y1-y2|＜M(M＝3)，那么认为这两个矩形大致位于一条直线上，即是对齐的。其中，M是预先设定的阈值，根据实际情况也可以取不同于3的其他值。这种对齐操作是为了去除字符组件CC中可能存在的噪声，以避免对字幕字符高度的计算造成负面影响。In step S440, the CC of the subtitle is aligned, and a new subtitle character component CC_New is obtained after alignment, and the average height of the new CC_New is estimated to be H2. The specific implementation of this alignment operation is as follows: firstly, the rectangle center of each CC is obtained, for example, represented by the coordinate value yn of the rectangle center. For the rectangle centers y1, y2 of any two CCs (y1, y2 are the coordinates of the center of the rectangle), if |y1-y2|<M (M=3), then the two rectangles are considered to be roughly on a straight line, that is aligned. Wherein, M is a preset threshold, and other values different from 3 can also be taken according to the actual situation. This alignment operation is to remove the noise that may exist in the character component CC, so as to avoid negative impact on the calculation of subtitle character height.

在步骤S445中，确定经过对齐操作后得到的新的字符组件CC_new中的某个字符组件的面积是否大于(α＊H＊W)或者新的字符组件CC_new的数目是否等于1。如果上述条件中的任何一个条件成立，则流程进入步骤S455，将字幕的字符高度确定为H1；否则，在步骤S450，将字幕的字符高度确定为H2。其中，α是预定的常数，可根据实际情况具体地设置。由于在步骤S440的对齐操作之后，可能因去除噪声的处理不理想而导致字符以更高的程度混在一起，甚至混成一团，其表现为所获得的CC_new的数目称为一个(所有字符组件粘连在一起)，或者某个CC_new的面积过大(某些字符组件粘连在一起)。步骤S445中的处理就是用于判断这种粘连情况的发生。如果发生粘连，则用步骤S435计算的平均值H1作为字幕字符的高度；如果没有发生粘连，则可用步骤S440中计算的经过噪声去除处理后的平均值H2作为字幕字符的高度。In step S445, it is determined whether the area of a certain character component in the new character component CC_new obtained after the alignment operation is greater than (α*H*W) or whether the number of the new character component CC_new is equal to 1. If any one of the above-mentioned conditions is true, the process goes to step S455, and the character height of the subtitle is determined as H1; otherwise, in step S450, the character height of the subtitle is determined as H2. Wherein, α is a predetermined constant, which can be specifically set according to the actual situation. After the alignment operation in step S440, the characters may be mixed together to a higher degree or even mixed together due to the unsatisfactory processing of removing noise, which shows that the number of CC_new obtained is called one (all character components stick together together), or the area of a certain CC_new is too large (some character components are glued together). The processing in step S445 is for judging the occurrence of this sticking situation. If conglutination occurs, the average value H1 calculated in step S435 is used as the height of subtitle characters; if conglutination does not occur, the average value H2 calculated in step S440 after noise removal can be used as the height of subtitle characters.

在上述的方法中，用于计算可能的可分位置的位置坐标值i1，i2，...ij以及任意两个字符组件(CC)的矩形中心坐标值y1，y2的坐标系可以取任何合适的平面坐标系，例如，以字幕图像所在平面作为坐标平面的笛卡尔直角坐标系，等等。In the above method, the coordinate system used to calculate the position coordinates i1, i2, ... ij of possible separable positions and the rectangular center coordinates y1, y2 of any two character components (CC) can be any suitable The plane coordinate system of , for example, the Cartesian rectangular coordinate system with the plane where the subtitle image is located as the coordinate plane, and so on.

另外，作为如图4中所示的字幕字符尺寸确定方法的一种优选方式，如果使用在发明人为白洪亮等、发明名称为“字幕存在时间确定装置和方法”、申请号为200810074125.3的中国专利申中公开的方案中得到的稳定全局特征作为字幕图像特征来计算字幕的字符高度，则可以对字幕进行鲁棒性稳定的跟踪，不断地优化跟踪的区域。In addition, as a preferred way of determining the subtitle character size as shown in Figure 4, if the inventor is Bai Hongliang, etc., the invention name is "Apparatus and Method for Determining Subtitle Existence Time", and the application number is 200810074125.3. The stable global feature obtained in the scheme disclosed in the application is used as the subtitle image feature to calculate the character height of the subtitle, so that the subtitle can be tracked robustly and stably, and the tracking area can be continuously optimized.

作为可替选方案，在根据本发明上述实施例的视频摘要生成方法中，还可以包括视频摘要输出步骤，用于将所述视频摘要生成步骤中生成的视频摘要输出。例如，可将所生成的视频摘要输出到预定的存储设备中。As an alternative, the method for generating a video summary according to the above-mentioned embodiments of the present invention may further include a video summary outputting step for outputting the video summary generated in the video summary generating step. For example, the generated video summary can be output to a predetermined storage device.

根据本发明的其他实施例，还提供了一种用于生成视频信号的视频摘要的装置。图5是示出了根据本发明的该实施例用于实现生成视频摘要的方法的装置500的示意框图。如图所示，装置500包括：视频解码单元510，用于对视频信号进行解码，以便获得多帧视频解码图像；字幕特性获得单元520，用于获得多帧视频解码图像中包含的所有字幕中每一个字幕的如下特性中的至少一种：字幕的持续时间，字幕在相应的视频解码图像中的位置，字幕的字符尺寸，字幕图像经光学字符识别(OCR)后的识别码和可信度；以及，视频摘要生成单元530，用于根据视频信号的内容与字幕的特性之间的关系，对所获得的字幕的至少一种特性进行处理，以便生成与视频信号的内容相适应的视频摘要。According to other embodiments of the present invention, an apparatus for generating a video summary of a video signal is also provided. Fig. 5 is a schematic block diagram showing an apparatus 500 for realizing the method for generating a video abstract according to the embodiment of the present invention. As shown in the figure, the device 500 includes: a video decoding unit 510, which is used to decode the video signal, so as to obtain a multi-frame video decoding image; a subtitle characteristic obtaining unit 520, which is used to obtain all the subtitles contained in the multi-frame video decoding image At least one of the following characteristics of each subtitle: the duration of the subtitle, the position of the subtitle in the corresponding video decoded image, the character size of the subtitle, the identification code and reliability of the subtitle image after optical character recognition (OCR) and, the video abstract generation unit 530 is used to process at least one characteristic of the obtained subtitles according to the relationship between the content of the video signal and the characteristics of the subtitles, so as to generate a video summary adapted to the content of the video signal .

本领域技术人员了解，如图5中示出的装置500所包括的视频解码单元510，字幕特性获得单元520和视频摘要生成单元530可以被配置成执行上面结合图1，2A-2B，3A-3B描述的视频摘要生成方法，以及虽然没有在各附图中示出但是已经在上面的各种具体实例中充分描述的视频摘要提取方法。Those skilled in the art understand that the video decoding unit 510 included in the apparatus 500 shown in FIG. 5 , the subtitle characteristic obtaining unit 520 and the video summary generating unit 530 can be configured to perform 3B describes the video abstract generation method, and although not shown in the drawings, the video abstract extraction method has been fully described in the various specific examples above.

上述装置中各个组成单元可通过软件、硬件或其组合的方式进行配置。配置可使用的具体手段或方式为本领域技术人员所熟知，在此不再赘述。Each constituent unit in the above device can be configured by means of software, hardware or a combination thereof. Specific means or manners that can be used for configuration are well known to those skilled in the art, and will not be repeated here.

本发明的其他实施例还提出了一种视频图像处理系统，其配备有根据上述图5示出的根据本发明的实施例的装置，因此可用于实现上述的根据本发明的实施例的视频摘要生成方法。Other embodiments of the present invention also propose a video image processing system, which is equipped with the device according to the embodiment of the present invention shown in FIG. generate method.

这种视频图像处理系统例如可以是电视采集卡、DVD播放器或者膝上型计算机，等等。Such a video image processing system may be, for example, a TV capture card, a DVD player, or a laptop computer, and the like.

此外，根据本发明上述实施例的视频摘要生成方法可以通过存储有机器可读取的指令代码的程序产品进来实现。这些指令代码由机器例如计算机读取并执行时，可执行根据本发明上述实施例的视频摘要生成方法的各个操作过程和步骤。该程序产品可以具有任意的表现形式，例如，目标程序、解释器执行的程序或者提供给操作系统的脚本程序等。In addition, the method for generating a video abstract according to the above-mentioned embodiments of the present invention may be implemented by a program product storing machine-readable instruction codes. When these instruction codes are read and executed by a machine such as a computer, various operation procedures and steps of the method for generating a video summary according to the above-mentioned embodiments of the present invention can be executed. The program product may have any form of expression, for example, an object program, a program executed by an interpreter, or a script program provided to an operating system.

相应地，用于承载上述存储有机器可读取的指令代码的程序产品的存储介质也包括在本发明的公开中。所述存储介质包括但不限于软盘、光盘、磁光盘、存储卡、存储棒，等等。Correspondingly, a storage medium for carrying the program product storing the above-mentioned machine-readable instruction codes is also included in the disclosure of the present invention. Such storage media include, but are not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.

在上面对本发明具体实施例的描述中，针对一种实施方式描述和/或示出的特征可以以相同或类似的方式在一个或更多个其它实施方式中使用，与其它实施方式中的特征相组合，或替代其它实施方式中的特征。In the above description of specific embodiments of the present invention, features described and/or illustrated for one embodiment can be used in the same or similar manner in one or more other embodiments, and features in other embodiments Combination or replacement of features in other embodiments.

应该强调，术语“包括/包含”在本文使用时指特征、要素、步骤或组件的存在，但并不排除一个或更多个其它特征、要素、步骤或组件的存在或附加。It should be emphasized that the term "comprising/comprising" when used herein refers to the presence of a feature, element, step or component, but does not exclude the presence or addition of one or more other features, elements, steps or components.

此外，本发明的方法不限于按照说明书中描述的时间顺序来执行，也可以按照其他的时间顺序地、并行地或独立地执行。因此，本说明书中描述的方法的执行顺序不对本发明的技术范围构成限制。In addition, the method of the present invention is not limited to being executed in the chronological order described in the specification, and may also be executed in other chronological order, in parallel or independently. Therefore, the execution order of the methods described in this specification does not limit the technical scope of the present invention.

尽管上面已经通过对本发明的具体实施例的描述对本发明进行了披露，但是，应该理解，本领域的技术人员可在所附权利要求的精神和范围内设计对本发明的各种修改、改进或者等同物。这些修改、改进或者等同物也应当被认为包括在本发明的保护范围内。Although the present invention has been disclosed by the description of specific embodiments of the present invention above, it should be understood that those skilled in the art can design various modifications, improvements or equivalents of the present invention within the spirit and scope of the appended claims things. These modifications, improvements or equivalents should also be considered to be included in the protection scope of the present invention.

Claims

1. A method for generating a video summary of a video signal, comprising the steps of:

The video decoding step is used to decode the video signal so as to obtain a multi-frame video decoding image;

The subtitle characteristic obtaining step is used to obtain at least one of the following characteristics of each subtitle in all the subtitles contained in the multi-frame video decoded image: the duration of the subtitle, the position of the subtitle in the corresponding video decoded image, the subtitle The character size of the subtitle image, the identification code and reliability after OCR of the subtitle image; and

A video summary generation step, for processing at least one characteristic of the obtained subtitles according to the relationship between the content of the video signal and the characteristics of the subtitles, so as to generate a video summary suitable for the content of the video signal ,

Wherein, the relationship between the content of the video signal and the characteristics of the subtitles indicates that: the decoded video image includes N levels of cascaded subtitles, wherein the first level subtitles to the Nth level subtitles reflect the main content of the video signal The aspects of are in descending or ascending order of importance, where N is a positive integer; and

The video abstract generation step generates a concatenated video abstract of the video signal, the cascaded video abstract includes: at least one of the concatenated subtitles including the first-level subtitles to the Nth-level subtitles Frame video decoded image.

2. method as claimed in claim 1, wherein, described video signal is the video signal of news type or talk type, N=3, the subtitle of the first level to the subtitle of the third level embodies the main content of the video signal Aspects are listed in descending order of importance; and

The video summary generation step generates the video signal by processing the obtained subtitle duration, the position of the subtitle in the corresponding video decoded image, the identification code and reliability of the subtitle image after OCR The cascading video summary, wherein the process of generating a cascading video summary includes:

Determining the subtitles located in the upper left or upper right of the video decoded image and having the longest duration as the first-level subtitles, the first-level subtitles represent the information contained in the video signal of the news type or talk type. the subject matter covered;

In all video decoded images containing the first-level subtitles, it will be located in the left middle or right middle of the video decoded image, and the duration is a second duration shorter than the first duration as the second-level subtitles, and according to the surname, title or salutation expressed by the identification code and reliability of the image of the second-level subtitles after optical character recognition, the second-level subtitles are divided into different subtitles, said different subtitles respectively representing identifiers of different participants of said news or conversation during the duration of said first level subtitles;

In all video decoded images containing the first-level subtitles and second-level subtitles, the subtitles will be located in the lower part of the video decoded images, and the duration is a third duration shorter than the second duration The subtitles are determined as third-level subtitles, wherein the part of the third-level subtitles that coexists with different subtitles in the second-level subtitles is determined as the content of different participants represented by the different subtitles the content of the conversation; and

Selecting at least one frame of decoded video image of concatenated subtitles including first-level subtitles, second-level subtitles and third-level subtitles as the concatenated video summary of the video signal.

3. The method as claimed in claim 1 or 2, further comprising a video summary output step for outputting the video summary generated in the video summary generation step.

4. A method for generating a video summary of a video signal, comprising the steps of:

Wherein, the relationship between the content of the video signal and the characteristics of the subtitle indicates that: the subtitle is located at a fixed position in the decoded video image and relates to the main content of the video signal, and

The video digest generating step includes: comparing the subtitles located at the fixed positions, determining the subtitles at the point in time when the image of the subtitles changes, and composing the video signal from a video decoded image including the determined subtitles video summary of .

5. method as claimed in claim 4, wherein, shown video signal is the video signal of match type, and the subtitle that described fixed position place is in described video decoded image, the scoreboard of the score information of display match, as well as

The step of generating the video summary includes: comparing all scoreboard subtitles, determining the scoreboard subtitles at the point in time when the image of the scoreboard subtitles changes, and forming the whole scoreboard subtitles from the video decoding images that contain the determined scoreboard subtitles A video summary of the described video signal.

6. The method according to claim 4 or 5, further comprising a video abstract output step for outputting the video abstract generated in the video abstract generating step.

7. A method for generating a video summary of a video signal, comprising the steps of:

Wherein, the relationship between the content of the video signal and the characteristics of the subtitles indicates that subtitles with a larger character size and a longer subtitle duration relate to more important content of the video signal, and

The video abstract generation steps include:

Setting the size threshold H of the character size of the subtitle and the time threshold T of the duration of the subtitle; and

A subtitle with a character size greater than H and a duration greater than T is determined, and a video digest of the video signal is composed of video decoded images containing the determined subtitle.

8. The method as claimed in claim 7, wherein the subtitle characteristic obtaining step obtains the character height of the subtitle by the following processing, as the character size of the subtitle:

processing the image of the subtitle to obtain a binarized image of the image of the subtitle;

If the height H of the image of the subtitle is less than or equal to 2 times of the predefined character height MIN_CS, then the image height H is determined as the character height of the subtitle, otherwise enter the next step;

For the image of the horizontal subtitle, the horizontal projection is performed, and the stroke pixel value of the subtitle image is accumulated in the horizontal direction, and the accumulated value is stored in Hist(i), i=1,2,...,H, or, for the image of the vertical subtitle, the Vertical projection, accumulating the stroke pixel values of the subtitle image in the vertical direction, and the accumulated value is stored in Hist(i), i=1,2,....H, where i is an index value as a positive integer, indicating the segmentation point ;

If Hist(i)<=T, it means that the subtitle is separable at the position of the split point i, record the position coordinate values i1, i2,...ij of all possible separable positions, and set (i2-i1) ,(i3-i2),...,(ij-i(j-1)) is the average value of the character height of the subtitle, and when Hist(i)>T, it means that the subtitle is inseparable and enters the next step, Wherein, j is a positive integer less than or equal to H, H is the number of segmentation points, and T is a predetermined constant;

Connected domain CCA analysis is performed on the image features of subtitles, and the average value of CC height of character components of subtitles is estimated to be H1;

Perform an alignment operation on the character component CC of the subtitle, and obtain a new character component CC_New after alignment, and estimate the average height of the new character component CC_New to be H2;

If the area of a certain character component in the new character component CC_new is greater than (α*H*W) or the number of the new character component CC_new is equal to 1, then the character height of the subtitle is determined as H1, otherwise the character height of the subtitle is determined is H2, where α is a predetermined constant, and W is the width of the subtitle image.

9. The method according to claim 7 or 8, further comprising a video abstract output step for outputting the video abstract generated in the video abstract generation step.

10. A method for generating a video summary of a video signal, comprising the steps of:

Wherein, the relationship between the content of the video signal and the characteristics of the subtitle indicates that the subtitle has a longer duration and the subtitle located at a specific position in the video decoded image involves more important content of the video signal, and

The video abstract generation steps include:

setting a time threshold T for the duration of subtitles; and

A subtitle whose duration is greater than T and located at a specific position of the video decoded picture is determined, and a video summary of the video signal is composed from the video decoded picture containing the determined subtitle.

11. The method according to claim 10, further comprising a video summary output step for outputting the video summary generated in the video summary generation step.

12. A method for generating a video summary of a video signal, comprising the steps of:

Wherein, the relationship between the content of the video signal and the characteristics of the subtitles indicates that the subtitles with larger character size and located at a specific position in the video decoded image relate to more important content of the video signal, and

The video abstract generation steps include:

setting the size threshold H of the character size of the subtitle; and

A subtitle whose character size is greater than H and located at the specific position of the video decoded image is determined, and a video digest of the video signal is composed from the video decoded image including the determined subtitle.

13. The method as claimed in claim 12, wherein the subtitle characteristic obtaining step obtains the character height of the subtitle by the following processing, as the character size of the subtitle:

14. The method as claimed in claim 12 or 13, further comprising a video summary output step for outputting the video summary generated in the video summary generation step.

15. A method for generating a video summary of a video signal, comprising the steps of:

Wherein, the relationship between the content of the video signal and the characteristics of the subtitle indicates that the subtitle located at a specific position in the video decoded image and whose subtitle image has a higher reliability after optical character recognition is more important in relation to the video signal the content of , and

The video abstract generation steps include:

Setting the threshold L of the reliability of the subtitle image after optical character recognition; and

Determining the subtitles located at the specific position in the decoded video image and having a reliability greater than L after optical character recognition of the subtitle image, and composing a video abstract of the video signal from the decoded video image containing the determined subtitles.

16. The method according to claim 15, further comprising a video summary output step for outputting the video summary generated in the video summary generation step.

17. A method for generating a video summary of a video signal, comprising the steps of:

Wherein, the relationship between the content of the video signal and the characteristics of the subtitle indicates that the subtitle with a larger character size and a higher reliability of the subtitle image after optical character recognition involves more important content of the video signal, and ,

The video abstract generation steps include:

Setting the size threshold H of the character size of the subtitle, and setting the threshold L of the reliability of the subtitle image after optical character recognition; and

Determine the subtitles whose character size is greater than H and whose reliability after optical character recognition is greater than L, and form the video summary of the video signal from the video decoding images containing the determined subtitles.

18. The method as claimed in claim 17, wherein the subtitle characteristic obtaining step obtains the character height of the subtitle by the following processing, as the character size of the subtitle:

19. The method as claimed in claim 17 or 18, further comprising a video summary output step for outputting the video summary generated in the video summary generation step.

20. An apparatus for generating a video summary of a video signal, comprising:

A video decoding unit is used to decode video signals so as to obtain multi-frame video decoding images;

A subtitle characteristic obtaining unit, configured to obtain at least one of the following characteristics of each subtitle in all the subtitles contained in the multi-frame video decoded image: the duration of the subtitle, the position of the subtitle in the corresponding video decoded image, the subtitle The character size of the subtitle image, the identification code and reliability of the subtitle image after OCR; and

A video summary generation unit, configured to process at least one characteristic of the obtained subtitles according to the relationship between the content of the video signal and the characteristics of the subtitles, so as to generate a video summary suitable for the content of the video signal ,

Wherein, the relationship between the content of the video signal and the characteristics of the subtitles indicates that: the decoded video image includes N levels of cascaded subtitles, wherein the first level subtitles to the Nth level subtitles reflect the main content of the video signal The importance of aspects of is arranged in descending or ascending order, where N is a positive integer; and

The video summary generating unit generates a concatenated video summary of the video signal, and the concatenated video summary includes: at least one of the concatenated subtitles including the first-level subtitles to the Nth-level subtitles Frame video decoded image.

21. The device according to claim 20, wherein the video signal is a video signal of a news type or a talk type, N=3, and the first-level subtitles to the third-level subtitles embody the main content of the video signal Aspects are listed in descending order of importance; and

The video summary generation unit generates the video signal by processing the obtained subtitle duration, the position of the subtitle in the corresponding video decoded image, the identification code and reliability of the subtitle image after OCR The cascading video summary, wherein the video summary generation unit is configured to:

22. The device according to claim 20 or 21, further comprising a video summary output unit configured to output the video summary generated by the video summary generation unit.

23. An apparatus for generating a video summary of a video signal, comprising:

The video summary generating unit is configured to compare the subtitles located at the fixed positions, determine the subtitles at the point in time when the image of the subtitles changes, and compose the subtitles from video decoded images containing the determined subtitles Video summary of the video signal.

24. The device as claimed in claim 23, wherein the shown video signal is a video signal of a game type, and the subtitles positioned at a fixed position are a scoreboard displaying score information of a game in the video decoded image, as well as

The video summary generating unit is configured to: compare all scoreboard subtitles, determine scoreboard subtitles at a point in time at which an image of scoreboard subtitles changes, and compose by video decoding images containing the determined scoreboard subtitles A video digest of the video signal.

25. The device according to claim 23 or 24, further comprising a video summary output unit configured to output the video summary generated by the video summary generation unit.

26. An apparatus for generating a video summary of a video signal, comprising:

The video summary generation unit is configured to:

Subtitles with a character size greater than H and a duration greater than T are determined, and a video digest of the video signal is composed from video decoded images containing the determined subtitles.

27. The apparatus according to claim 26, wherein the subtitle characteristic obtaining unit is configured to obtain the character height of the subtitle as the character size of the subtitle through the following processing:

For the image of the horizontal subtitle, horizontal projection is performed, and the stroke pixel value of the subtitle image is accumulated in the horizontal direction, and the accumulated value is stored in Hist(i), i=1,2,....H, or, for the image of the vertical subtitle, the Vertical projection, accumulating the stroke pixel values of the subtitle image in the vertical direction, the accumulated value is stored in Hist(i), i=1,2,....H, where i is an index value as a positive integer, representing the segmentation point;

If Hist(i)<=T, it means that the subtitle is separable at the position of the split point i, record the position coordinate values i1, i2,...ij of all possible separable positions, and set (i2-i1) ,(i3-i2),...,(ij-i(i-1)) is the average value of the character height of the subtitle, and, when Hist(i)>T, it means that the subtitle is inseparable, and the next step is processed , wherein j is a positive integer less than or equal to H, H is the number of split points, and T is a predetermined constant;

28. The device according to claim 26 or 27, further comprising a video summary output unit configured to output the video summary generated by the video summary generation unit.

29. An apparatus for generating a video summary of a video signal, comprising:

The video summary generation unit is configured to:

setting a time threshold T for the duration of subtitles; and

A subtitle whose duration is greater than T and located at a specific position of the video decoded image is determined, and a video digest of the video signal is composed from the video decoded image containing the determined subtitle.

30. The device according to claim 29, further comprising a video summary output unit configured to output the video summary generated by the video summary generation unit.

31. An apparatus for generating a video summary of a video signal, comprising:

The video summary generation unit is configured to:

setting the size threshold H of the character size of the subtitle; and

A subtitle whose character size is greater than H and located at the specific position of the video decoded image is determined, and a video summary of the video signal is composed from the video decoded image including the determined subtitle.

32. The apparatus according to claim 31 , wherein the subtitle characteristic obtaining unit is configured to obtain the character height of the subtitle as the character size of the subtitle through the following processing:

If Hist(i)<=T, it means that the subtitle is separable at the position of the split point i, record the position coordinate values i1, i2,...ij of all possible separable positions, and set (i2-i1) ,(i3-i2),...,(ij-i(j-1)) is the average value of the character height of the subtitle, and, when Hist(i)>T, it means that the subtitle is inseparable and enters the next step of processing , wherein j is a positive integer less than or equal to H, H is the number of split points, and T is a predetermined constant;

33. The device according to claim 31 or 32, further comprising a video summary output unit configured to output the video summary generated by the video summary generation unit.

34. An apparatus for generating a video summary of a video signal, comprising:

The video summary generation unit is configured to:

35. The device according to claim 34, further comprising a video summary output unit configured to output the video summary generated by the video summary generation unit.

36. An apparatus for generating a video summary of a video signal, comprising:

Wherein, the relationship between the content of the video signal and the characteristics of the subtitle indicates that the subtitle with a larger character size and a higher reliability of the subtitle image after optical character recognition involves more important content of the video signal, and

The video summary generation unit is configured to:

Determine the subtitles whose character size is greater than H and whose reliability after optical character recognition is greater than L, and compose the video summary of the video signal from the video decoding images containing the determined subtitles.

37. The apparatus according to claim 36, wherein the subtitle characteristic obtaining unit is configured to obtain the character height of the subtitle as the character size of the subtitle through the following processing:

38. The device according to claim 36 or 37, further comprising a video summary output unit configured to output the video summary generated by the video summary generation unit.