CN111918145B

CN111918145B - Video segmentation method and video segmentation device

Info

Publication number: CN111918145B
Application number: CN201910376477.2A
Authority: CN
Inventors: 苏芸
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-05-07
Filing date: 2019-05-07
Publication date: 2022-09-09
Anticipated expiration: 2039-05-07
Also published as: WO2020224362A1; CN111918145A

Abstract

The application provides a video segmentation method and a video segmentation device, wherein the method comprises the following steps: the video segmenting device segments the video to be processed according to at least one of content description information used for describing the content of the video to be processed and a presentation demonstrated in the video to be processed uploaded in advance and voice information of the video to be processed. The technical scheme can be combined with information except the content of the video to be processed to segment the video to be processed, so that the accuracy of segmentation can be improved.

Description

Video segmentation method and video segmentation device

技术领域technical field

本申请涉及信息技术领域，更具体地，涉及视频分段方法和视频分段装置。The present application relates to the field of information technology, and more particularly, to a video segmentation method and a video segmentation device.

背景技术Background technique

为了便于方便地观看视频，可以将一个完整的视频划分为多个分段。这样，用户可以直接观看感兴趣的分段。In order to conveniently watch the video, a complete video can be divided into multiple segments. In this way, the user can directly view the segment of interest.

目前一种常见的视频分段方法是基于视频中的文字信息对视频分段的。上述视频中的文字信息可以是视频中的字幕，或者是对视频进行语音识别得到的文字。换句话说，目前对视频进行分段的基础都是来自于视频本身。此外，目前这种基于视频中的文字信息视频分段需要获取视频的全部文字信息。直播视频的视频流是实时产生的。因此，只有在视频直播结束之后，才能得到视频的全部文字信息。因此，上述方法并不能对直播视频进行实时分段。此外上述方法只是根据视频的文字信息对视频进行分段。这样可能会造成确定的分段点并不一定是合适的分段点。A common video segmentation method at present is to segment the video based on the text information in the video. The text information in the video may be subtitles in the video, or text obtained by performing speech recognition on the video. In other words, the current basis for segmenting a video comes from the video itself. In addition, the current video segmentation based on the text information in the video needs to acquire all the text information of the video. The video stream of live video is generated in real time. Therefore, only after the end of the live video, the full text information of the video can be obtained. Therefore, the above method cannot perform real-time segmentation of live video. In addition, the above method only segments the video according to the text information of the video. This may result in that the determined segmentation point is not necessarily a suitable segmentation point.

发明内容SUMMARY OF THE INVENTION

本申请提供一种视频分段方法和视频分段装置，能够提高视频分段的准确性。The present application provides a video segmentation method and a video segmentation device, which can improve the accuracy of video segmentation.

第一方面，本申请实施例提供一种视频分段方法，包括：视频分段装置获取待处理视频的文本信息和该待处理视频的语音信息，其中该文本信息包括该待处理视频中的演示文稿和该待处理视频的内容描述信息中的至少一个；该视频分段装置根据该文本信息和该语音信息，确定该待处理视频的分段点；该视频分段装置根据该分段点，对该待处理视频进行分段。上述技术方案可以结合除待处理视频本身的内容以外的信息，对该待处理视频进行分段，从而可以提高分段的准确性。In a first aspect, an embodiment of the present application provides a video segmentation method, including: a video segmentation device obtains text information of a video to be processed and voice information of the video to be processed, wherein the text information includes a presentation in the video to be processed at least one of the content description information of the manuscript and the video to be processed; the video segmentation device determines the segmentation point of the video to be processed according to the text information and the voice information; the video segmentation device, according to the segmentation point, Segment the video to be processed. The above technical solution can combine information other than the content of the video to be processed to segment the to-be-processed video, thereby improving the accuracy of the segmentation.

结合第一方面，在第一方面的一种可能的实现方式中，在该文本信息包括该演示文稿的情况下，该视频分段装置根据该文本信息和该语音信息，确定该待处理视频的分段点，包括：确定该演示文稿的切换点，该演示文稿在该切换点前后呈现的内容不同；根据该语音信息，确定至少一个停顿点；根据该切换点和该至少一个停顿点，确定该分段点。演示文稿发生切换往往意味着演讲人的演讲的内容发生了变化。因此，上述技术方案通过考虑演示文稿的变化，将待处理视频划分为不用的分段，可以合理地快速确定待处理视频的分段点。另外，上述技术方案在确定待处理视频的分段点时，只需要基于演示文稿的切换点以及切换点附近的停顿点。因此，上述技术方案不需要获取完成的视频文件，就可以对视频进行分段。换句话说，利用上述技术方案可以实时对待处理视频进行分段。因此，上述技术方案可应用于直播视频的分段处理。With reference to the first aspect, in a possible implementation manner of the first aspect, in the case that the text information includes the presentation, the video segmentation device determines, according to the text information and the voice information, the The segmentation point includes: determining a switching point of the presentation, where the presentation presents different contents before and after the switching point; determining at least one pause point according to the voice information; determining according to the switching point and the at least one pause point the segment point. Switching presentations often means that the content of the speaker's speech has changed. Therefore, the above technical solution divides the video to be processed into different segments by considering the changes of the presentation, so that the segment points of the video to be processed can be reasonably and quickly determined. In addition, when determining the segment point of the video to be processed in the above technical solution, it only needs to be based on the switching point of the presentation and the pause point near the switching point. Therefore, in the above technical solution, the video can be segmented without acquiring the completed video file. In other words, the video to be processed can be segmented in real time by using the above technical solution. Therefore, the above technical solution can be applied to the segmentation processing of live video.

结合第一方面，在第一方面的一种可能的实现方式中，该根据该切换点和该至少一个停顿点，确定该分段点，包括：在确定该切换点与该至少一个停顿点中的一个停顿点相同的情况下，确定该切换点为该分段点；在确定该至少一个停顿点中的任一个停顿点与该切换点的均不相同的情况下，确定该至少一个停顿点中距离该切换点最近的一个停顿点为该分段点。With reference to the first aspect, in a possible implementation manner of the first aspect, the determining the segment point according to the switching point and the at least one pause point includes: in determining the switching point and the at least one pause point In the case of the same pause point, the switching point is determined as the segmentation point; in the case of determining that any one of the at least one pause point is different from the switch point, the at least one pause point is determined. The pause point closest to the switching point is the segment point.

结合第一方面，在第一方面的一种可能的实现方式中，该确定该演示文稿的切换点，包括：确定获取到用于指示切换该演示文稿的内容的切换信号的时刻为该切换点。With reference to the first aspect, in a possible implementation manner of the first aspect, the determining of the switching point of the presentation includes: determining the moment when a switching signal for instructing to switch the content of the presentation is obtained as the switching point .

结合第一方面，在第一方面的一种可能的实现方式中，该文本信息还包括该内容描述信息，在该视频分段装置根据该文本信息和该语音信息，确定该待处理视频的分段点之前，该方法还包括：确定该演示文稿的当前页的演示时长小于或等于第一预设时长且大于第二预设时长。With reference to the first aspect, in a possible implementation manner of the first aspect, the text information further includes the content description information, and the video segmentation device determines the segmentation of the to-be-processed video according to the text information and the voice information Before the paragraph point, the method further includes: determining that the presentation duration of the current page of the presentation is less than or equal to the first preset duration and greater than the second preset duration.

结合第一方面，在第一方面的一种可能的实现方式中，在该文本信息包括该内容描述信息的情况下，该视频分段装置根据该文本信息和该语音信息，确定该待处理视频的分段点，包括：根据该语音信息、该内容描述信息的关键词和该语音信息中的停顿点，确定该待处理视频的分段点。内容描述信息是用户提前输入的用于描述待处理视频的信息。内容描述信息通常可以包括待处理视频中的一些关键信息，例如关键词，重点内容等。因此，基于内容描述信息可以更准确地确定待处理视频不同分段中描述的重点内容，从而更准确的对待处理视频进行分段。With reference to the first aspect, in a possible implementation manner of the first aspect, when the text information includes the content description information, the video segmentation device determines the to-be-processed video according to the text information and the voice information The segmentation point includes: determining the segmentation point of the to-be-processed video according to the voice information, the keywords of the content description information, and the pause point in the voice information. The content description information is information input in advance by the user to describe the video to be processed. The content description information may generally include some key information in the video to be processed, such as keywords, key content, and the like. Therefore, the key content described in different segments of the video to be processed can be more accurately determined based on the content description information, so that the video to be processed can be segmented more accurately.

结合第一方面，在第一方面的一种可能的实现方式中，该语音信息包括第一语音信息片段和第二语音信息片段，其中该第二语音信息片段是在该第一语音信息片段之前且与该第一语音信息片段相邻的语音信息片段，根据该语音信息、该内容描述信息的关键词和该语音信息中的停顿点，确定该待处理视频的分段点，包括：根据该第一语音信息片段、该第二语音信息片段、该内容描述信息的关键词和该语音信息中的停顿点，确定第一分段点，其中该待处理视频的分段点包括该第一分段点。另外，上述技术方案在确定待处理视频的分段点时，只需要基于内容描述信息的关键词和两个相邻视频片段中的语音信息就可以确定分段点的位置。视频片段的划分可以按照固定时间和步长实现。因此，在视频播放过程中就可以对已播放的视频划分出视频片段。这样，可以不需要获取完成的视频文件，就可以对视频进行分段。换句话说，利用上述技术方案可以实时对待处理视频进行分段。因此，上述技术方案可应用于直播视频的分段处理。With reference to the first aspect, in a possible implementation manner of the first aspect, the voice information includes a first voice information fragment and a second voice information fragment, wherein the second voice information fragment is before the first voice information fragment And for the voice information fragment adjacent to the first voice information fragment, according to the voice information, the keyword of the content description information and the pause point in the voice information, determine the segmentation point of the video to be processed, including: according to the The first segment of voice information, the second segment of voice information, the keyword of the content description information, and the pause point in the voice information, determine the first segment point, wherein the segment point of the video to be processed includes the first segment point. paragraph point. In addition, when determining the segmentation point of the video to be processed in the above technical solution, the position of the segmentation point can be determined only based on the keywords of the content description information and the voice information in the two adjacent video segments. The division of video segments can be implemented according to fixed time and step size. Therefore, during the video playback process, video segments can be divided for the played video. In this way, the video can be segmented without acquiring the completed video file. In other words, the video to be processed can be segmented in real time by using the above technical solution. Therefore, the above technical solution can be applied to the segmentation processing of live video.

结合第一方面，在第一方面的一种可能的实现方式中，根据该第一语音信息片段、该第二语音信息片段、该内容描述信息的关键词和该语音信息中的停顿点，确定第一分段点，包括：根据该第一语音信息片段的关键词、该第二语音信息片段的关键词、该第一语音信息片段的内容、该第二语音信息片段的内容和该内容描述信息的关键词，确定该第一语音信息片段和该第二语音信息片段的相似度；确定该第一语音信息片段和该第二语音信息片段的相似度小于相似度阈值；根据该语音信息中的停顿点，确定该第一分段点。With reference to the first aspect, in a possible implementation manner of the first aspect, according to the first voice information fragment, the second voice information fragment, the keywords of the content description information, and the pause point in the voice information, determine The first segmentation point includes: according to the keywords of the first voice information segment, the keywords of the second voice information segment, the content of the first voice information segment, the content of the second voice information segment, and the content description information keywords, determine the similarity between the first voice information segment and the second voice information segment; determine that the similarity between the first voice information segment and the second voice information segment is less than the similarity threshold; The pause point is determined to determine the first segment point.

结合第一方面，在第一方面的一种可能的实现方式中，该语音信息中的停顿点包括该第一语音信息片段内的停顿点或与该第一语音信息片段相邻的停顿点，根据该语音信息中的停顿点，确定该第一分段点，包括：根据该第一语音信息片段内的停顿点数目、与该第一语音信息片段相邻的停顿点数目、停顿时长以及与停顿点相邻的词中的至少一个，确定该第一分段点。With reference to the first aspect, in a possible implementation manner of the first aspect, the pause point in the voice information includes a pause point in the first voice information segment or a pause point adjacent to the first voice information segment, Determining the first segment point according to the pause points in the voice information includes: according to the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the pause duration, and the At least one of the words adjacent to the pause point determines the first segmentation point.

结合第一方面，在第一方面的一种可能的实现方式中，该第一语音信息片段内的停顿点包括K个，或，与该第一语音信息片段相邻的停顿点包括K个。该根据该第一语音信息片段内的停顿点数目、与该第一语音信息片段相邻的停顿点数目、停顿时长以及与停顿点相邻的词中的至少一个，确定该第一分段点包括：在K等于1的情况下，确定该K个停顿点为该分段点；在K为大于或等于2的正整数且与该K个停顿点相邻的K个词中包括一个预设词的情况下，确定与该一个预设词相邻的停顿点为该分段点；在K为大于或等于2的正整数且该K个词中包括至少两个该预设词的情况下，确定与至少两个该预设词相邻的至少两个停顿点中停顿时长最长的停顿点为该分段点；在K为大于或等于2的正整数且该K个词中不包括该预设词的情况下，确定该K个停顿点中停顿时长最长的停顿点为该分段点。With reference to the first aspect, in a possible implementation manner of the first aspect, the pause points in the first voice information segment include K, or, the pause points adjacent to the first voice information segment include K. determining the first segment point according to at least one of the number of pause points in the first voice information segment, the number of pause points adjacent to the first speech information segment, the pause duration and the words adjacent to the pause point Including: when K is equal to 1, determining the K pause points as the segment points; including a preset in the K words where K is a positive integer greater than or equal to 2 and adjacent to the K pause points In the case of a word, determine that the pause point adjacent to the one preset word is the segment point; in the case where K is a positive integer greater than or equal to 2 and the K words include at least two of the preset words , determine the pause point with the longest pause time among the at least two pause points adjacent to at least two of the preset words as the segment point; where K is a positive integer greater than or equal to 2 and the K words do not include In the case of the preset word, it is determined that the pause point with the longest pause duration among the K pause points is the segment point.

结合第一方面，在第一方面的一种可能的实现方式中，该文本信息还包括该演示文稿，在该视频分段装置根据该文本信息和该语音信息，确定该待处理视频的分段点之前，该方法还包括：确定该演示文稿的当前页的演示时长大于第一预设时长；或者确定该演示文稿的当前页的演示时长小于或等于第二预设时长。上述技术方案可以避免演示文稿长期不变或者变化非常迅速导致的分段不合适的情况下发生。With reference to the first aspect, in a possible implementation manner of the first aspect, the text information further includes the presentation, and the video segmentation device determines the segment of the to-be-processed video according to the text information and the voice information Before the point, the method further includes: determining that the presentation duration of the current page of the presentation is greater than a first preset duration; or determining that the presentation duration of the current page of the presentation is less than or equal to a second preset duration. The above technical solution can avoid occurrence of inappropriate segmentation caused by the presentation being unchanged for a long time or changing very rapidly.

结合第一方面，在第一方面的一种可能的实现方式中，该方法还包括：该视频分段装置根据分段语音信息的内容、该分段语音信息的关键词和目标文本的关键词，确定该分段的摘要，其中该目标文本包括该演示文稿和该内容描述信息中的至少一个。基于上述技术方案，用户在回看视频时可以利用摘要快速确定希望回看的位置。此外，上述技术方案在确定摘要的过程中考虑到了待处理视频以外的信息。这样可以提高确定出的摘要的准确性，以及提高确定摘要的速度。With reference to the first aspect, in a possible implementation manner of the first aspect, the method further includes: the video segmentation device according to the content of the segmented voice information, the keywords of the segmented voice information and the keywords of the target text , determining an abstract of the segment, wherein the target text includes at least one of the presentation and the content description information. Based on the above technical solutions, the user can use the summary to quickly determine the desired position when viewing the video. In addition, in the above technical solution, information other than the video to be processed is considered in the process of determining the digest. This improves the accuracy of the digests determined and the speed with which the digests are determined.

结合第一方面，在第一方面的一种可能的实现方式中，该视频分段装置根据分段语音信息的内容、该分段语音信息的关键词和目标文本的关键词，确定该分段的摘要，包括：根据分段语音信息的内容、该分段语音信息的关键词和目标文本的关键词，确定第三关键词向量；根据该第三关键词向量，确定该分段的摘要。With reference to the first aspect, in a possible implementation manner of the first aspect, the video segmentation device determines the segment according to the content of the segmented speech information, the keywords of the segmented speech information, and the keywords of the target text The abstract includes: determining a third keyword vector according to the content of the segmented speech information, the keywords of the segmented speech information and the keywords of the target text; and determining the abstract of the segment according to the third keyword vector.

结合第一方面，在第一方面的一种可能的实现方式中，该视频分段装置根据该第三关键词向量，确定该分段的摘要，包括：根据该目标文本与该分段语音信息，确定参考文本，其中该参考文本包括J个句子，J为大于或等于1的正整数；根据该分段语音信息的关键词、该目标文本的关键词和该J个句子中的每个句子，确定J个关键词向量；根据该第三关键词向量和该J个关键词向量，确定该分段的摘要。With reference to the first aspect, in a possible implementation manner of the first aspect, the video segmentation device determines the abstract of the segment according to the third keyword vector, including: according to the target text and the segmented voice information , determine the reference text, wherein the reference text includes J sentences, and J is a positive integer greater than or equal to 1; according to the keywords of the segmented speech information, the keywords of the target text and each sentence in the J sentences , determine J keyword vectors; according to the third keyword vector and the J keyword vectors, determine the abstract of the segment.

结合第一方面，在第一方面的一种可能的实现方式中，根据该目标文本与该分段语音信息，确定参考文本，包括：在该目标文本中包括冗余的句子的情况下，将该目标文本中的该冗余的句子删除，得到修正目标文本并将该修正目标文本与该分段语音信息合并，得到该参考文本；在该目标文本不包括该冗余的句子的情况下，将该目标文本与该分段语音信息合并，得到该参考文本。With reference to the first aspect, in a possible implementation manner of the first aspect, determining the reference text according to the target text and the segmented speech information includes: when the target text includes redundant sentences, adding The redundant sentence in the target text is deleted, and the revised target text is obtained and the revised target text is merged with the segmented speech information to obtain the reference text; when the target text does not include the redundant sentence, The target text is combined with the segmented speech information to obtain the reference text.

结合第一方面，在第一方面的一种可能的实现方式中，根据该第三关键词向量和该J个关键词向量，确定该分段的摘要，包括：根据该第三关键词向量和该J个关键词向量，确定J个距离，其中该J个距离中的第j个距离是根据该第三关键词向量和该J个关键词向量中的第j个关键词向量确定的，j为大于或等于1且小于或等于J的正整数；确定该J个距离中距离最短的R个距离，R为大于或等于1且小于J的正整数；确定该分段的摘要，其中该分段的摘要包括与该R个距离对应的句子。With reference to the first aspect, in a possible implementation manner of the first aspect, determining the abstract of the segment according to the third keyword vector and the J keyword vectors includes: according to the third keyword vector and the J keyword vectors For the J keyword vectors, determine J distances, wherein the jth distance in the J distances is determined according to the third keyword vector and the jth keyword vector in the J keyword vectors, j is a positive integer greater than or equal to 1 and less than or equal to J; determine the R distances with the shortest distances among the J distances, where R is a positive integer greater than or equal to 1 and less than J; determine the summary of the segment, where the score The summary of the segment includes sentences corresponding to the R distances.

结合第一方面，在第一方面的一种可能的实现方式中，待处理视频为实时视频流，该待处理视频的语音信息为该实时视频流从该实时视频流的起始时刻或者上一分段点到当前时刻的语音信息。上述技术方案可以实现对视频的实时分段。换句话说，利用上述技术方案对视频进行分段时，并不需要获取该待处理视频的全部内容。因此，上述技术方案可以实现对直播视频的实时分段。In combination with the first aspect, in a possible implementation manner of the first aspect, the video to be processed is a real-time video stream, and the voice information of the video to be processed is the real-time video stream from the starting moment of the real-time video stream or the last time. Voice information from the segment point to the current moment. The above technical solution can realize real-time segmentation of video. In other words, when using the above technical solution to segment a video, it is not necessary to acquire the entire content of the video to be processed. Therefore, the above technical solution can realize real-time segmentation of live video.

第二方面，本申请实施例提供一种视频分段装置，该装置包括用于执行第一方面或第一方面的任一种可能的实现方式的单元。In a second aspect, an embodiment of the present application provides a video segmentation apparatus, where the apparatus includes a unit for executing the first aspect or any possible implementation manner of the first aspect.

可以选的，第二方面的视频分段装置可以为计算机设备，或者可以为可用于计算机设备的部件(例如芯片或者电路等)。Optionally, the video segmentation apparatus of the second aspect may be a computer device, or may be a component (eg, a chip or a circuit, etc.) that can be used in a computer device.

第三方面，本申请实施例提供一种存储介质，该存储介质存储用于实现第一方面或第一方面的任一种可能的实现方式所述的方法的指令。In a third aspect, an embodiment of the present application provides a storage medium, where the storage medium stores an instruction for implementing the method described in the first aspect or any possible implementation manner of the first aspect.

第四方面，本申请实施例提供了一种包含指令的计算机程序产品，当该计算机程序产品在计算机上运行时，使得计算机执行上述第一方面或第一方面的任一种可能的实现方式所述的方法。In a fourth aspect, the embodiments of the present application provide a computer program product containing instructions, when the computer program product is run on a computer, the computer program product enables the computer to execute the first aspect or any of the possible implementations of the first aspect. method described.

附图说明Description of drawings

图1是一个可以应用本申请实施例提供的视频分段方法的系统的示意图；1 is a schematic diagram of a system to which a video segmentation method provided by an embodiment of the present application can be applied;

图2是另一个可以应用本申请实施例提供的视频分段方法的系统的示意图；2 is a schematic diagram of another system to which the video segmentation method provided by the embodiment of the present application can be applied;

图3是根据本申请实施例提供的视频分段方法的示意性流程图；3 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application;

图4是根据本申请实施例提供的视频会议流程的示意图；4 is a schematic diagram of a video conference process provided according to an embodiment of the present application;

图5是根据本申请实施例提供的视频分段方法的示意性流程图；5 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application;

图6是根据本申请实施例提供的一种视频分段的方法的示意性流程图；6 is a schematic flowchart of a method for video segmentation provided according to an embodiment of the present application;

图7是根据本申请实施例提供的视频分段装置的结构框图；7 is a structural block diagram of a video segmentation apparatus provided according to an embodiment of the present application;

图8是根据本申请实施例提供的视频分段装置的结构框图。FIG. 8 is a structural block diagram of a video segmentation apparatus provided according to an embodiment of the present application.

具体实施方式Detailed ways

下面将结合附图，对本申请中的技术方案进行描述。The technical solutions in the present application will be described below with reference to the accompanying drawings.

本申请中，“至少一个”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B的情况，其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下中的至少一项(个)”或其类似表达，是指的这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a，b或c中的至少一项(个)，可以表示：a、b、c、a-b、a-c、b-c、或a-b-c，其中a、b、c可以是单个，也可以是多个。另外，在本申请的实施例中，“第一”、“第二”等字样并不对数量和执行次序进行限定。In this application, "at least one" means one or more, and "plurality" means two or more. "And/or", which describes the association relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, which can indicate: the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of a single item(s) or a plurality of items(s). For example, at least one (a) of a, b or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, and c may be single or multiple. In addition, in the embodiments of the present application, words such as "first" and "second" do not limit the quantity and execution order.

需要说明的是，本申请中，“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言，使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。It should be noted that, in this application, words such as "exemplary" or "for example" are used to represent examples, illustrations or illustrations. Any embodiment or design described in this application as "exemplary" or "such as" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present the related concepts in a specific manner.

本申请的各个方面或特征可以实现成方法、装置或使用标准编程和/或工程技术的制品。本申请中使用的术语“制品”涵盖可从任何计算机可读器件、载体或介质访问的计算机程序。例如，计算机可读介质可以包括，但不限于：磁存储器件(例如，硬盘、软盘或磁带等)，光盘(例如，压缩盘(compact disc，CD)、数字通用盘(digital versatile disc，DVD)等)，智能卡和闪存器件(例如，可擦写可编程只读存储器(erasable programmable read-only memory，EPROM)、卡、棒或钥匙驱动器等)。另外，本文描述的各种存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读介质。术语“机器可读介质”可包括但不限于，无线信道和能够存储、包含和/或承载指令和/或数据的各种其它介质。Various aspects or features of the present application may be implemented as methods, apparatus, or articles of manufacture using standard programming and/or engineering techniques. The term "article of manufacture" as used in this application encompasses a computer program accessible from any computer readable device, carrier or medium. For example, computer-readable media may include, but are not limited to, magnetic storage devices (eg, hard disks, floppy disks, or magnetic tapes, etc.), optical disks (eg, compact discs (CDs), digital versatile discs (DVDs) etc.), smart cards and flash memory devices (eg, erasable programmable read-only memory (EPROM), cards, stick or key drives, etc.). Additionally, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" may include, but is not limited to, wireless channels and various other media capable of storing, containing, and/or carrying instructions and/or data.

图1是一个可以应用本申请提供的视频分段方法的系统的示意图。图1示出了一个视频会议系统，该系统中包括会议控制服务器101、会议终端111、会议终端112和会议终端113。会议终端111、会议终端112和会议终端113可以通过会议控制服务器101建立会议。FIG. 1 is a schematic diagram of a system to which the video segmentation method provided by this application can be applied. FIG. 1 shows a video conference system, which includes a conference control server 101 , a conference terminal 111 , a conference terminal 112 and a conference terminal 113 . The conference terminal 111 , the conference terminal 112 and the conference terminal 113 can establish a conference through the conference control server 101 .

视频会议通常会包括至少两个会场。每个会场可以通过一个会议终端接入会议控制服务器。该会议终端可以是用于接入视频会议的设备。该会议终端可以用于接收会议数据，并根据该会议数据在显示装置上呈现会议内容。该会议终端可以包括主机和显示装置。该主机可以通过通信接口接收会议数据，根据接收到的会议数据，生成视频信号，并将该视频信号通过有线或者无线的方式输出至该显示装置。该显示装置根据接收到的视频信号，呈现会议内容。可选的，在一些实施例中，该显示装置可以是内置在该主机中的。例如，该会议终端可以是笔记本电脑、平板电脑、智能手机等内置有显示装置的电子设备。可选的，在另一些实施例中，该显示装置可以是外置于主机的显示装置。例如，该主机可以是计算机主机，该显示装置可以是显示器、电视机或者投影仪。又如，即使该主机中内置有显示装置，用于呈现会议内容的显示装置也可以是外置于该主机的显示装置。例如，该主机可以是笔记本电脑，该显示装置可以是外接于该笔记本电脑的显示器、电视机或者投影仪。A video conference usually includes at least two conference rooms. Each conference site can access the conference control server through a conference terminal. The conference terminal may be a device for accessing a video conference. The conference terminal can be used to receive conference data, and present conference content on the display device according to the conference data. The conference terminal may include a host and a display device. The host can receive conference data through a communication interface, generate a video signal according to the received conference data, and output the video signal to the display device in a wired or wireless manner. The display device presents the conference content according to the received video signal. Optionally, in some embodiments, the display device may be built in the host. For example, the conference terminal may be an electronic device with a built-in display device, such as a notebook computer, a tablet computer, and a smart phone. Optionally, in other embodiments, the display device may be a display device external to the host. For example, the host may be a computer host, and the display device may be a monitor, a television, or a projector. For another example, even if the host has a built-in display device, the display device used for presenting the meeting content may also be a display device external to the host. For example, the host may be a notebook computer, and the display device may be a display, a television or a projector externally connected to the notebook computer.

在一些情况下，视频会议可能会包括一个主会场和至少一个分会场。在这种情况下，主会场中的会议终端(例如会议终端111)可以将采集到的主会场的媒体流上传至会议控制服务器101。会议控制服务器101可以根据接收到的媒体流，生成会议数据，并将会议数据发送至分会场中的会议终端(例如会议终端112和会议终端113)。会议终端112和会议终端113中可以根据接收到的会议数据在显示装置上呈现会议内容。In some cases, a video conference may include a main venue and at least one branch venue. In this case, the conference terminal (for example, the conference terminal 111 ) in the main conference site can upload the collected media stream of the main conference site to the conference control server 101 . The conference control server 101 may generate conference data according to the received media stream, and send the conference data to conference terminals (eg, conference terminal 112 and conference terminal 113 ) in the sub-venue. The conference terminal 112 and the conference terminal 113 may present the conference content on the display device according to the received conference data.

在另一些情况下，视频会议中的至少两个会场可能并没有主、次之分。每个会场中的会议终端都可以将采集到的媒体流上传至会议控制服务器101。例如，假设会议终端111是会场1中用于接入视频会议的会议终端，会议终端112是会场2中用于接入视频会议的会议终端，会议终端的113是会场3中用于接入视频会议的会议终端。会议终端111可以将采集到的会场1的媒体流上传至会议控制服务器101，会议控制服务器101可以根据会场1的媒体流生成会议数据1，并将该会议数据1发送至会议终端112和会议终端113，会议终端112和会议终端113可以根据接收到的会议数据1在显示装置上呈现会议内容。类似的，会议终端112也可以将采集到的会场2的媒体流上传至会议控制服务器，会议控制服务器101可以根据会场2的媒体流生成会议数据2，并将该会议数据2发送至会议终端111和会议终端113，会议终端111和会议终端113可以根据接收到的会议数据2在显示装置上呈现会议内容；会议终端113也可以将采集到的会场3的媒体流上传至会议控制服务器，会议控制服务器101可以根据会场3的媒体流生成会议数据3，并将该会议数据3发送至会议终端111和会议终端112，会议终端111和会议终端112可以根据接收到的会议数据3在显示装置上呈现会议内容。In other cases, at least two conference sites in the video conference may not be divided into primary and secondary. The conference terminal in each conference site can upload the collected media stream to the conference control server 101 . For example, it is assumed that conference terminal 111 is the conference terminal used to access the video conference in site 1, conference terminal 112 is the conference terminal used to access the video conference in site 2, and conference terminal 113 is used to access the video conference in site 3 Conference terminal for the meeting. The conference terminal 111 can upload the collected media stream of the conference site 1 to the conference control server 101, and the conference control server 101 can generate conference data 1 according to the media stream of the conference site 1, and send the conference data 1 to the conference terminal 112 and the conference terminal. 113 , the conference terminal 112 and the conference terminal 113 may present the conference content on the display device according to the received conference data 1 . Similarly, the conference terminal 112 can also upload the collected media stream of the conference site 2 to the conference control server, and the conference control server 101 can generate conference data 2 according to the media stream of the conference site 2, and send the conference data 2 to the conference terminal 111 and the conference terminal 113, the conference terminal 111 and the conference terminal 113 can present the conference content on the display device according to the received conference data 2; the conference terminal 113 can also upload the collected media stream of the conference site 3 to the conference control server, and the conference control The server 101 can generate the conference data 3 according to the media stream of the conference site 3, and send the conference data 3 to the conference terminal 111 and the conference terminal 112, and the conference terminal 111 and the conference terminal 112 can present the conference data 3 on the display device according to the received conference data 3 meeting content.

可选的，在一些实施例中，该媒体流可以是音频流。可选的，在另一些实施例中，该媒体流可以是视频流。负责采集媒体流的媒体设备可以是内置在会议终端内的(例如会议终端内的摄像头和麦克风)，也可以是外接于该会议终端的，本申请实施例对此并不限定。Optionally, in some embodiments, the media stream may be an audio stream. Optionally, in other embodiments, the media stream may be a video stream. The media device responsible for collecting media streams may be built into the conference terminal (for example, a camera and a microphone in the conference terminal), or may be externally connected to the conference terminal, which is not limited in this embodiment of the present application.

可选的，在一些实施例中，会议的发言人在发言过程中使用演示文稿。在此情况下，该媒体流可以是该发言人发言的音频流。该发言人在发言过程中使用的演示文稿可以通过辅流(也可以称为数据流、计算机屏幕流)上传至会议控制服务器101。会议控制服务器101根据接收到的音频流和辅流，生成会议数据。可选的，在一些可能的实现方式中，该会议数据可以包括接收到的音频流和辅流。可选的，在另一些可能的实现方式中，该会议数据可以包括对接收到的音频流进行处理后得到的处理后的音频流以及该辅流。对接收到的音频流进行处理可以是对接收到的音频流进行转码操作，例如可以降低该音频流的码率，以便减少向其他会议终端传输该音频流所需的数据量。可选的，在另一些可能的实现方式中，该会议数据可以包括接收到的音频流、与接收到的音频流码率不同的音频流以及该辅流。这样，会议终端可以根据网络状况和/或接入会议的方式选择合适的音频流。例如，若会议终端的网络状况较好或者利用Wi-Fi接入会议，则可以选择码率较高的音频流，这样可以收听到更清晰的声音。又如，若会议终端的网络状况较差，则可以选择码率较低的音频流，这样可以减少因网络状况不好导致的会议直播中断的情况发生。又如，若会议终端利用移动网络接入会议，则可以选择码率较低的音频流，这样可以减少流量的消耗。可选的，在另一些可能的实现方式中，该会议数据中除了包括至少一种码率的音频流以及辅流外，还可以包括对应于发言人发言的字幕。该字幕可以是基于语音识别技术，将发言人的发言进行语音-文字转换生成的，也可以是人工记录的发言人的发言，或者，也可以是在语音-文字转换的基础上结合人工修改生成的。Optionally, in some embodiments, the speaker of the conference uses the presentation during the speech. In this case, the media stream may be the audio stream of the speaker speaking. The presentation used by the speaker during the speech can be uploaded to the conference control server 101 through an auxiliary stream (also referred to as a data stream or a computer screen stream). The conference control server 101 generates conference data according to the received audio stream and auxiliary stream. Optionally, in some possible implementation manners, the conference data may include the received audio stream and auxiliary stream. Optionally, in some other possible implementation manners, the conference data may include a processed audio stream obtained by processing the received audio stream and the auxiliary stream. The processing of the received audio stream may be to perform a transcoding operation on the received audio stream, for example, the bit rate of the audio stream may be reduced, so as to reduce the amount of data required for transmitting the audio stream to other conference terminals. Optionally, in some other possible implementation manners, the conference data may include a received audio stream, an audio stream with a different bit rate from the received audio stream, and the auxiliary stream. In this way, the conference terminal can select an appropriate audio stream according to the network conditions and/or the way of accessing the conference. For example, if the network condition of the conference terminal is good or the conference is accessed through Wi-Fi, an audio stream with a higher bit rate can be selected, so that a clearer sound can be heard. For another example, if the network condition of the conference terminal is poor, an audio stream with a lower bit rate can be selected, which can reduce the occurrence of interruption of the conference live broadcast caused by the poor network condition. For another example, if the conference terminal uses a mobile network to access the conference, an audio stream with a lower bit rate can be selected, which can reduce traffic consumption. Optionally, in some other possible implementation manners, in addition to the audio stream and auxiliary stream of at least one bit rate, the conference data may also include subtitles corresponding to the speaker's speech. The subtitles may be generated by voice-to-text conversion of the speaker's speech based on speech recognition technology, or may be manually recorded by the speaker's speech, or may be generated based on speech-to-text conversion combined with manual modification of.

可选的，在另一些实施例中，该媒体流可以是发言人在发言过程中的视频流。换句话说，该媒体流中可以同时包括发言人在发言过程中的声音信息和画面信息。相应的，上传至会议控制服务器101的媒体流是该视频流。在一些情况下，假设该发言人在发言过程中使用了演示文稿，并且使用输出装置(例如投影仪、电视机等)展示演示文稿。该媒体流中的画面信息中包括该发言人展示的演示文稿。因此，上传至会议控制服务器101的视频流中包括该演示文稿。在此情况下，会议控制服务器101可以直接根据该视频流确定会议数据。在另一些情况下，发言人在发言过程中使用的演示文稿可以通过辅流的方式上传至会议控制服务器101。会议控制服务器101可以根据采集到的视频流和该辅流，生成会议数据。可选的，在一些可能的实现方式中，该会议数据可以包括采集到的视频流和辅流。可选的，在另一些可能的实现方式中，该会议数据可以包括对采集到的视频流进行处理后得到的处理后的视频以及该辅流。对采集到的视频流进行处理可以是对采集到的视频流进行转码操作，例如可以降低该视频流的分辨率，以便减少向其他会议终端传输该视频流所需的数据量。可选的，在另一些可能的实现方式中，该会议数据可以包括采集到的视频流、与采集到的视频流的分辨率不同的视频流以及该辅流。这样，会议终端可以根据网络状况和/或接入会议的方式选择合适的视频流。例如，若会议终端的网络状况较好或者利用Wi-Fi接入会议，则可以选择分辨率较高的视频流，这样可以使得观众看到更清晰的画面。又如，若会议终端的网络状况较差，则可以选择分辨率较低的视频流，这样可以减少因网络状况不好导致的会议直播中断的情况发生。又如，若会议终端利用移动网络接入会议，则可以选择分辨率较低的视频流，这样可以减少流量的消耗。可选的，在另一些可能的实现方式中，该会议数据中除了包括至少一种分辨率的视频流以及辅流外，还可以包括对应于发言人发言的字幕。该字幕可以是基于语音识别技术，将发言人的发言进行语音-文字转换生成的，也可以是人工记录的发言人的发言，或者，也可以是在语音-文字转换的基础上结合人工修改生成的。Optionally, in other embodiments, the media stream may be a video stream of the speaker during the speech. In other words, the media stream may include both the audio information and the picture information of the speaker during the speech. Correspondingly, the media stream uploaded to the conference control server 101 is the video stream. In some cases, it is assumed that the speaker uses a presentation during the presentation, and an output device (eg, projector, television, etc.) is used to present the presentation. The picture information in the media stream includes the presentation presented by the speaker. Therefore, the presentation is included in the video stream uploaded to the conference control server 101 . In this case, the conference control server 101 can directly determine the conference data according to the video stream. In other cases, the presentation used by the speaker during the speech may be uploaded to the conference control server 101 by way of auxiliary stream. The conference control server 101 can generate conference data according to the collected video stream and the auxiliary stream. Optionally, in some possible implementation manners, the conference data may include collected video streams and auxiliary streams. Optionally, in some other possible implementation manners, the conference data may include the processed video and the auxiliary stream obtained by processing the collected video stream. The processing of the collected video stream may be to perform a transcoding operation on the collected video stream, for example, the resolution of the video stream may be reduced, so as to reduce the amount of data required for transmitting the video stream to other conference terminals. Optionally, in some other possible implementation manners, the conference data may include a captured video stream, a video stream with a resolution different from that of the captured video stream, and the auxiliary stream. In this way, the conference terminal can select an appropriate video stream according to the network conditions and/or the way of accessing the conference. For example, if the network condition of the conference terminal is good or the conference is accessed through Wi-Fi, a video stream with a higher resolution can be selected, so that the audience can see a clearer picture. For another example, if the network condition of the conference terminal is poor, a video stream with a lower resolution can be selected, which can reduce the interruption of the live broadcast of the conference caused by the poor network condition. For another example, if the conference terminal uses a mobile network to access the conference, a video stream with a lower resolution can be selected, which can reduce traffic consumption. Optionally, in some other possible implementation manners, in addition to the video stream and auxiliary stream of at least one resolution, the conference data may also include subtitles corresponding to the speaker's speech. The subtitles may be generated by voice-to-text conversion of the speaker's speech based on speech recognition technology, or may be manually recorded by the speaker's speech, or may be generated based on speech-to-text conversion combined with manual modification of.

图2是另一个可以应用本申请提供的视频分段方法的系统的示意图。图2示出了一个远程教育系统，该系统中包括课程服务器201、主设备211、客户端设备212和客户端设备213。FIG. 2 is a schematic diagram of another system to which the video segmentation method provided by the present application can be applied. FIG. 2 shows a distance education system, which includes a course server 201 , a main device 211 , a client device 212 and a client device 213 .

主设备211可以将采集到的媒体流上传至课程服务器201。课程服务器201可以根据该媒体流生成课程数据，并将该课程数据发送至客户端设备212和客户端设备213，客户端设备212和客户端设备213可以根据接收到的课程数据在显示装置上呈现课程内容。The main device 211 can upload the captured media stream to the course server 201 . The course server 201 can generate course data according to the media stream, and send the course data to the client device 212 and the client device 213, and the client device 212 and the client device 213 can present on the display device according to the received course data Course content.

主设备211可以是一个笔记本电脑、台式计算机。客户端设备212和客户端设备213可以是笔记本电脑、台式计算机、平板电脑、智能手机等。The main device 211 may be a notebook computer or a desktop computer. Client device 212 and client device 213 may be laptops, desktops, tablets, smartphones, and the like.

可选的，在一些实施例中，负责讲课的老师在讲课过程中使用演示文稿。在此情况下，该媒体流可以是该老师讲课的音频流。该老师在讲课过程中使用的演示文稿可以通过辅流上传至课程服务器201。课程服务器201根据接收到的音频流和辅流，生成课程数据。Optionally, in some embodiments, the teacher in charge of the lecture uses the presentation during the lecture. In this case, the media stream may be an audio stream of the teacher's lecture. The presentation used by the teacher in the course of the lecture can be uploaded to the course server 201 through the auxiliary stream. The course server 201 generates course data according to the received audio stream and auxiliary stream.

可选的，在另一些实施例中，该媒体流可以是老师在讲课过程中的视频流。换句话说，该媒体流中可以同时包括该老师在讲课过程中的声音信息和画面信息。相应的，上传至课程服务器201的媒体流是该视频流。在一些情况下，假设该老师在讲课过程中使用了演示文稿，并且使用输出装置(例如投影仪、电视机等)展示演示文稿。该媒体流中的画面信息中包括该老师展示的演示文稿。因此，上传至课程服务器201的视频流中包括该演示文稿。在此情况下，课程服务器201可以直接根据该视频流确定课程数据。在另一些情况下，该老师在讲课过程中使用的演示文稿可以通过辅流的方式上传至课程服务器201。课程服务器201可以根据采集到的视频流和该辅流，生成课程数据。Optionally, in other embodiments, the media stream may be a video stream of a teacher during a lecture. In other words, the media stream may include both the audio information and the picture information of the teacher during the lecture. Correspondingly, the media stream uploaded to the course server 201 is the video stream. In some cases, it is assumed that the teacher uses the presentation during the lecture and uses an output device (eg, projector, television, etc.) to present the presentation. The picture information in the media stream includes the presentation presented by the teacher. Therefore, the presentation is included in the video stream uploaded to the course server 201 . In this case, the course server 201 can directly determine the course data according to the video stream. In other cases, the presentations used by the teacher in the course of lectures can be uploaded to the course server 201 by means of auxiliary streams. The course server 201 can generate course data according to the collected video stream and the auxiliary stream.

课程数据的具体内容与会议数据的具体内容相似，为了简洁，就不再赘述。The specific content of the course data is similar to the specific content of the conference data, and will not be repeated for brevity.

图3是根据本申请实施例提供的视频分段方法的示意性流程图。图3所示的方法可以由视频分段装置执行。该视频分段装置可以是能够实现本申请实施例提供的方法的计算机设备，例如个人计算机、笔记本电脑、平板电脑、服务器等，也可以是能够实现本申请实施例提供的方法的设置在计算机设备内部的硬件，例如显卡、图形处理器(GraphicsProcessing Unit，GPU)，或者，也可以是一个用于实现本申请实施例提供的方法的专用装置。例如，在一些实施例中，该视频分段装置可以是如图1所示系统中的会议控制服务器101或者设置在会议控制服务器101中的一个硬件。又如，在另一些实施例中，该视频分段装置可以是如图1所示的系统中的上传媒体流的会议终端或者该会议终端中的一个硬件。又如，在另一些实施例中，该视频分段装置可以是如图2所示的系统中的主设备211或者设置在主设备211中的一个硬件。又如，在另一些实施例中，该视频分段装置可以是如图2所示实施例中的课程服务器201或者课程服务器201中的一个硬件。FIG. 3 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application. The method shown in FIG. 3 may be performed by a video segmentation apparatus. The video segmentation apparatus may be a computer device capable of implementing the methods provided by the embodiments of the present application, such as a personal computer, a notebook computer, a tablet computer, a server, etc., or may be a computer device capable of implementing the methods provided by the embodiments of the present application. The internal hardware, for example, a graphics card, a graphics processing unit (Graphics Processing Unit, GPU), or, may also be a dedicated device for implementing the method provided in this embodiment of the present application. For example, in some embodiments, the video segmentation device may be the conference control server 101 in the system shown in FIG. 1 or a piece of hardware provided in the conference control server 101 . For another example, in other embodiments, the video segmentation device may be a conference terminal that uploads a media stream in the system as shown in FIG. 1 or a piece of hardware in the conference terminal. For another example, in other embodiments, the video segmentation apparatus may be the main device 211 in the system as shown in FIG. 2 or a piece of hardware provided in the main device 211 . For another example, in other embodiments, the video segmentation device may be the course server 201 in the embodiment shown in FIG. 2 or a piece of hardware in the course server 201 .

为了便于描述，假设图3所示的方法是应用在如图1所示的系统中。For ease of description, it is assumed that the method shown in FIG. 3 is applied in the system shown in FIG. 1 .

301，视频分段装置获取待处理视频的文本信息和该待处理视频的语音信息，其中该文本信息包括该待处理视频中的演示文稿和该待处理视频的内容描述信息中的至少一个。301. The video segmentation apparatus acquires text information of the video to be processed and voice information of the video to be processed, where the text information includes at least one of a presentation in the video to be processed and content description information of the video to be processed.

该演示文稿是指会议的发言人在发言过程中演示的文稿。本申请实施例对演示文稿的文件格式并不限定，只要该是在发言人的发言过程中通过显示装置展示的文稿都可以是该演示文稿。例如，该演示文稿可以是ppt格式或者pptx格式的文稿。又如，该演示文稿可以是PDF格式的文稿。又如，该演示文稿也可以是word格式或者txt格式的文稿。The presentation refers to the presentation presented by the speaker of the conference during the speaking process. The embodiment of the present application does not limit the file format of the presentation, as long as the presentation is displayed by the display device during the speech of the speaker, the presentation may be the presentation. For example, the presentation may be in ppt format or pptx format. As another example, the presentation may be a presentation in PDF format. For another example, the presentation may also be a document in word format or txt format.

该内容描述信息是会议的发言人或者会议的主持人在开始进行会议之前上传的用于描述发言内容的信息。可选的，在一些实施例中，该内容描述信息中包括该发言人在视频会议中的发言内容的提纲、摘要和/或关键信息。例如，该内容描述信息中可以包括该发言人的发言内容的关键词。又如，该内容描述信息中可以包括该发言人的发言内容的摘要。又如，该发言人的发言内容可以包括多个部分，该内容描述信息中可以包括该多个部分中的每个部分的主题、摘要和/或关键词。The content description information is information that is uploaded by the speaker of the conference or the moderator of the conference and is used to describe the content of the speech before the conference starts. Optionally, in some embodiments, the content description information includes an outline, abstract and/or key information of the speech content of the speaker in the video conference. For example, the content description information may include keywords of the speech content of the speaker. For another example, the content description information may include a summary of the speech content of the speaker. For another example, the speech content of the speaker may include multiple parts, and the content description information may include the subject, abstract and/or keywords of each of the multiple parts.

该语音信息可以包括对该发言人的发言进行语音-文字转换得到对应的文字。本申请实施例对语音-文字转换的具体实现方式并不限定，只要能够将识别到的语音转换为对应的文字即可。该语音信息还可以包括对该发言人的发言进行语音识别得到的至少一个停顿点。停顿点表示说话者在说话过程中的自然停顿。The voice information may include voice-to-text conversion of the speaker's speech to obtain corresponding text. The specific implementation manner of speech-text conversion is not limited in the embodiments of the present application, as long as the recognized speech can be converted into corresponding text. The voice information may further include at least one pause point obtained by voice recognition of the speaker's speech. Pause points represent natural pauses in the speaker's speech.

302，该视频分段装置根据该文本信息和该语音信息，确定该待处理视频的分段点。302. The video segmentation apparatus determines a segmentation point of the to-be-processed video according to the text information and the voice information.

如上所述，该文本信息可以包括该演示文稿和该内容描述信息中的至少一个。换句话说，该文本信息可以存在以下三种情况：As described above, the textual information may include at least one of the presentation and the content description information. In other words, the text information can exist in the following three situations:

情况1：该文本信息中只包括该演示文稿；Case 1: Only the presentation is included in the text information;

情况2：该文本信息中只包括该内容描述信息；Case 2: Only the content description information is included in the text information;

情况3：该文本信息中包括该演示文稿和该内容描述信息。Case 3: The text information includes the presentation and the content description information.

换句话说，在一些情况下，该发言人可以仅在发言过程中展示该演示文稿，而并不会提前上传该内容描述信息。因此可能出现上述情况1。在另一些情况下，该发言人可以仅提前上传该内容描述信息，而并不会在发言过程中展示演示文稿。因此，可能出现上述情况2。在另一些情况下，该发言人可以即在发言过程中展示该演示文稿，也会提前上传该内容描述信息。因此，可能书写上述情况3。In other words, in some cases, the speaker may only present the presentation during the speech without uploading the content description information in advance. Therefore, the above situation 1 may occur. In other cases, the speaker may simply upload the description in advance without showing the presentation during the presentation. Therefore, the above situation 2 may occur. In other cases, the speaker may present the presentation while speaking and upload the description in advance. Therefore, it is possible to write Case 3 above.

对于上述情况1，该视频分段装置可以根据该演示文稿，确定该待处理视频的分段点。For the above case 1, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the presentation.

对于上述情况2，该视频分段装置可以根据该内容描述信息，确定该待处理视频的分段点。For the above case 2, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the content description information.

可选的，在一些实施例中，对于上述情况3，该视频分段装置可以根据该演示文稿和该内容描述信息中的一个，确定该待处理视频的分段点。换句话说，在该文本信息中包括该演示文稿和该内容描述信息的情况下，该视频分段装置可以根据该演示文稿或该内容描述信息，确定该待处理视频的分段点。Optionally, in some embodiments, for the above case 3, the video segmentation apparatus may determine the segmentation point of the video to be processed according to one of the presentation and the content description information. In other words, when the text information includes the presentation and the content description information, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the presentation or the content description information.

可选的，在一些实施例中，在该文本信息包括该演示文稿和该内容描述信息的情况下，该视频分段装置可以确定该演示文稿的当前页的演示时长，并根据该演示文稿的当前页的演示时长，确定是根据该演示文稿确定待处理视频的分段点，还是根据该内容描述信息确定该待处理视频的分段点。Optionally, in some embodiments, when the text information includes the presentation and the content description information, the video segmentation device may determine the presentation duration of the current page of the presentation, and perform a presentation according to the presentation duration. The presentation duration of the current page determines whether to determine the segment point of the to-be-processed video according to the presentation, or to determine the segment point of the to-be-processed video according to the content description information.

可选的，在一些实施例中，该视频分段装置可以在该演示文稿的当前页的演示时大于第一预设时长的情况下，根据该内容描述信息和该语音信息，确定该待处理视频的分段点。这样，可以避免因发言人长时间演示相同的内容导致的视频的一个分段过长的情况发生。该第一预设时长可以根据需要设定。例如，该第一预设时长可以是10分钟。又如，该第一预设时长可以为15分钟。Optionally, in some embodiments, the video segmentation apparatus may determine the pending processing according to the content description information and the voice information when the presentation of the current page of the presentation is longer than the first preset duration. Segmentation point of the video. In this way, it can be avoided that one segment of the video is too long due to the speaker demonstrating the same content for a long time. The first preset duration can be set as required. For example, the first preset time period may be 10 minutes. For another example, the first preset duration may be 15 minutes.

可选的，在一些实施例中，该视频分段装置可以在该演示文稿的当前页的演示时长小于或等于第二预设时长的情况下，根据该内容描述信息和该语音信息，确定该待处理视频的分段点。这样，可以避免因发言人频繁地切换演示文稿的显示内容导致的视频的一个分段过短的情况发生。与该第一预设时长类似，该第二预设时长可以根据需要设定。例如，该第二预设时长可以为20秒。又如，该第二预设时长可以为10秒。Optionally, in some embodiments, the video segmentation apparatus may determine, according to the content description information and the voice information, when the presentation duration of the current page of the presentation is less than or equal to the second preset duration. The segmentation point of the video to be processed. In this way, it can be avoided that one segment of the video is too short due to frequent switching of the displayed content of the presentation by the speaker. Similar to the first preset duration, the second preset duration can be set as required. For example, the second preset duration may be 20 seconds. For another example, the second preset duration may be 10 seconds.

第一预设时长大于第二预设时长。The first preset duration is greater than the second preset duration.

可选的，在一些实施例中，该视频分段装置可以在该演示文稿的当前页的演示时长大于该第二预设时长且小于或等于该第一预设时长的情况下，根据该演示文稿和该语音信息，确定该待处理视频的分段点。Optionally, in some embodiments, the video segmentation apparatus may, when the presentation duration of the current page of the presentation is greater than the second preset duration and less than or equal to the first preset duration, according to the presentation. The text and the voice information are used to determine the segmentation point of the video to be processed.

可选的，在另一些实施例中，也可以只设置该第一预设时长。若该演示文稿在当前页的演示时长大于该第一预设时长，则根据该内容描述信息和该语音信息，确定该待处理视频的分段点。若该演示文稿在当前页的演示时长不大于该第一预设时长的情况下，则可以根据该演示文稿和该语音信息，确定该待处理视频的分段点。该演示文稿的当前页的演示时长是该演示文稿停留在当前页的时长。Optionally, in other embodiments, only the first preset duration may be set. If the presentation duration of the presentation on the current page is greater than the first preset duration, the segment point of the video to be processed is determined according to the content description information and the voice information. If the presentation duration of the current page of the presentation is not greater than the first preset duration, the segment point of the video to be processed may be determined according to the presentation and the voice information. The presentation duration of the current page of the presentation is how long the presentation stays on the current page.

可选的，在一些实施例中，该演示文稿的当前页的演示时长的起始时刻是演示文稿切换到当前页的时刻，该演示文稿的当前页的演示时长的结束时刻是演示文稿从当前页切换到其他页的时刻。Optionally, in some embodiments, the start moment of the presentation duration of the current page of the presentation is the moment when the presentation is switched to the current page, and the end moment of the presentation duration of the current page of the presentation is the moment when the presentation changes from the current page. The moment when a page switches to another page.

例如，若演示文稿在T₁时刻切换到第n页(n为大于或等于1的正整数)，则该视频分段装置可以从T₁时刻开始计时。若在计时时长超过该第一预设时长的情况下，该演示文稿还未切换到第n+1页，则该视频分段装置可以根据该内容描述信息和该语音信息，确定该待处理视频的分段点。若在T₂时刻(T₂大于T₁)该演示文稿切换到第n+1页，且从T₁时刻到T₂时刻的时长小于或等于该第二预设时长，则该视频分段装置可以根据该内容描述信息和该语音信息，确定该待处理视频的分段点。若从T₁时刻到T₂时刻的时长小于或等于该第一预设时长并且大于该第二预设时长，则该视频分段装置可以根据该演示文稿和该语音信息，确定该待处理视频的分段点。更具体地，该视频分段装置可以根据第n页的演示文稿和该语音信息，确定该待处理视频的分段点。For example, if the presentation is switched to the nth page at time T1 ( _n is a positive integer greater than or equal to ₁ ), the video segmentation device can start timing from time T1. If the presentation has not been switched to the n+1th page when the timing exceeds the first preset time, the video segmentation device may determine the to-be-processed video according to the content description information and the voice information segmentation point. If the presentation is switched to the n ₊ _1th page at time T2 (T2 is greater _than T1), and the duration from time T1 to time T2 is less _than or equal to the _second preset duration, the video segmentation device The segmentation point of the video to be processed can be determined according to the content description information and the voice information. If the duration from time T1 to time T2 is less _than or equal to the first preset duration and greater than the _second preset duration, the video segmentation device can determine the video to be processed according to the presentation and the voice information segmentation point. More specifically, the video segmentation device may determine the segmentation point of the to-be-processed video according to the presentation on the nth page and the voice information.

可选的，在另一些实施例中，该演示文稿的当前页的演示时长的起始时刻可以是上一分段点，该演示文稿的当前页的演示时长的结束时刻是演示文稿从当前页切换到其他页的时刻。Optionally, in some other embodiments, the start moment of the presentation duration of the current page of the presentation may be the point of the previous segment, and the end moment of the presentation duration of the current page of the presentation is when the presentation starts from the current page. The moment to switch to another page.

例如，假设演示文稿在T₃时刻切换到第n页(n为大于或等于1的正整数)，且该演示文稿在第n页的停留时长大于该第一预设时长。在此情况下，该视频分段装置根据该内容描述信息和该语音信息，确定该待处理视频的一个分段点为T₄时刻。该视频分段装置可以从T₄时刻开始计时。若在计时时长超过该第一预设时长的情况下，该演示文稿还未切换到第n+1页，则该视频分段装置可以根据该内容描述信息和该语音信息，确定该待处理视频的分段点。若在T₅时刻(T₅大于T₄)该演示文稿切换到第n+1页，且从T₄时刻到T₅时刻的时长不大于该第一预设时长并大于该第二预设时长，则该视频分段装置可以根据该演示文稿和该语音信息，确定该待处理视频的分段点。更具体地，该视频分段装置可以根据第n页的演示文稿和该语音信息，确定该待处理视频的分段点。For example, it is assumed that the presentation is switched to the nth page at time T3 ₍ n is a positive integer greater than or equal to 1), and the presentation duration on the nth page is longer than the first preset duration. In this case, the video segmentation device determines, according to the content description information and the voice information, a segmentation point of the to-be _- processed video as time T4. _The video segmentation means may start timing from time T4. If the presentation has not been switched to the n+1th page when the timing exceeds the first preset time, the video segmentation device may determine the to-be-processed video according to the content description information and the voice information segmentation point. If the presentation is switched to the n ₊ _1th page at time T5 ₍ T5 is greater than T4 ₎ , and the duration from time _T4 to time T5 is not greater than the first preset duration and greater than the second preset duration , the video segmentation device can determine the segmentation point of the video to be processed according to the presentation and the voice information. More specifically, the video segmentation device may determine the segmentation point of the to-be-processed video according to the presentation on the nth page and the voice information.

可选的，在另一些实施例中，在该文本信息包括该演示文稿和该内容描述信息的情况下，该视频分段装置可以根据该演示文稿和该语音信息，确定该待处理视频的分段点。换句话说，即使文本信息同时包括该演示文稿和该内容描述信息，该视频分段装置也可以只参考该演示文稿和该语音信息(即不会使用该内容描述信息)，确定该待处理视频的分段点。Optionally, in some other embodiments, when the text information includes the presentation and the content description information, the video segmentation device may determine the segmentation of the video to be processed according to the presentation and the voice information. paragraph point. In other words, even if the text information includes both the presentation and the content description information, the video segmentation apparatus may only refer to the presentation and the voice information (that is, without using the content description information) to determine the video to be processed segmentation point.

可选的，在另一些实施例中，在该文本信息包括该演示文稿和该内容描述信息的情况下，该视频分段装置可以根据该内容描述信息和该语音信息，确定该待处理视频的分段点。换句话说，即使文本信息同时包括该演示文稿和该内容描述信息，该视频分段装置也可以只参考该内容描述信息和该语音信息(即不会使用该演示文稿)，确定该待处理视频的分段点。Optionally, in other embodiments, in the case that the text information includes the presentation and the content description information, the video segmentation device may determine, according to the content description information and the voice information, the segment point. In other words, even if the text information includes both the presentation and the content description information, the video segmentation apparatus can only refer to the content description information and the voice information (that is, the presentation document will not be used) to determine the video to be processed segmentation point.

该视频分段装置根据该演示文稿和该语音信息，确定该待处理视频的分段点可以包括：该视频分段装置确定该演示文稿的切换点，该演示文稿在该切换点前后呈现的内容不同；该视频分段装置根据该语音信息，确定至少一个停顿点；该视频分段装置根据该切换点和该至少一个停顿点，确定该分段点。The video segmentation device determining the segmentation point of the to-be-processed video according to the presentation and the voice information may include: the video segmentation device determines a switching point of the presentation, and the content presented in the presentation before and after the switching point different; the video segmentation device determines at least one pause point according to the voice information; the video segmentation device determines the segmentation point according to the switching point and the at least one pause point.

演示文稿的切换点是指演示文稿发生切换的时刻。演示文稿发生切换可以是指演示文稿翻页。例如从第1页切换到第2页。演示文稿发生切换也可以是指在没有翻页的情况下，演示文稿的内容发生变化。例如，在演示文稿是文本文档的情况下，发言人可能仅展示该演示文稿的某一页的一部分(例如上半部分)，然后滚动到该页的剩余部分(例如下半部分)。虽然此时演示文稿并非翻页，但是演示文稿中的内容发生了变化。The switching point of a presentation is the moment when the presentation switches. A presentation switching occurs may refer to a page turning of the presentation. For example switching from page 1 to page 2. The presentation switching can also mean that the content of the presentation changes without turning pages. For example, where the presentation is a text document, the speaker may show only a portion of a page of the presentation (eg, the top half), and then scroll to the remainder of the page (eg, the bottom half). Although the presentation is not a page turn at this point, the content in the presentation has changed.

可选的，在一些实施例中，该视频分段装置可以获取到用于指示切换该演示文稿的内容的切换信号。在此情况下，该视频分段装置可以确定获取到该切换信号的时刻为该切换点。Optionally, in some embodiments, the video segmentation apparatus may acquire a switching signal for instructing to switch the content of the presentation. In this case, the video segmentation apparatus may determine that the moment when the switching signal is acquired is the switching point.

可选的，在一些实施例中，该视频分段装置可以获取到该演示文稿的内容。在此情况下，该视频分段装置可以根据该演示文稿的内容的变化来确定该切换点。例如，该视频分段装置可以在确定该待处理视频在第一时刻所呈现的演示文稿的内容与在第二时刻所呈现的演示文稿的内容不同的情况下，确定该第一时刻为该切换点。可选的，在一些实施例中，该第一时刻与该第二时刻是相邻的时刻，且该第一时刻在该第二时刻之前。可选的，在另一些实施例中，该第一时刻在该第二时刻之前且该第一时刻与该第二时刻间隔时长少于一个预设时长。换句话说，在此情况下，该视频分段装置可以每隔一段时长检测一下演示文稿呈现的内容是否发生变化。Optionally, in some embodiments, the video segmentation apparatus may acquire the content of the presentation. In this case, the video segmentation means may determine the switching point according to the change in the content of the presentation. For example, the video segmentation apparatus may determine that the first moment is the switching when it is determined that the content of the presentation presented at the first moment of the video to be processed is different from the content of the presentation presented at the second moment point. Optionally, in some embodiments, the first moment and the second moment are adjacent moments, and the first moment is before the second moment. Optionally, in other embodiments, the first moment is before the second moment and the interval between the first moment and the second moment is less than a preset duration. In other words, in this case, the video segmentation device can detect whether the content presented in the presentation changes at intervals of a certain period of time.

可选的，在一些实施例中，该视频分段装置可以结合获取到用于指示切换该演示文稿的内容的切换信号和该演示文稿所呈现的内容，确定该切换点。例如，该视频分段装置在T₁时刻获取到该切换信号。该视频分段装置可以获取该演示文稿在T₁时刻的前F₁帧呈现的内容以及T₁时刻之后的F₂帧呈现的内容，F₁和F₂为大于或等于1的正整数。可选的，在一些实施例中，F₁和F₂可以取较小的值，例如F₁和F₂可以等于2。这样可以减少计算量。如果该演示文稿在F₁帧和F₂帧中的连续两帧呈现的内容不同，则可以确定该演示文稿呈现内容发生变化的帧所在的时刻为该切换点。例如假设F₁和F₂的值均为2。若该演示文稿在四帧中的第2帧和第3帧呈现的内容不同，则可以确定第2帧所在的时刻为该切换点。利用切换信号和该演示文稿呈现的内容确定切换点，可以避免切换信号和演示文稿的画面切换不同步导致的确定出的切换点不准确的情况发生。Optionally, in some embodiments, the video segmentation apparatus may determine the switching point in combination with the acquired switching signal for instructing switching the content of the presentation and the content presented by the presentation. For example, the video segmentation device _acquires the switching signal at time T1. _The video segmentation apparatus can obtain the content presented by the F1 frame before the presentation at time T1 and the content presented by the F2 frame after the time T1, where _F1 and _F2 are positive _integers greater _than or equal to ₁ . Optionally, in some embodiments, F ₁ and F ₂ may take smaller values, for example, F ₁ and F ₂ may be equal to 2. This reduces the amount of computation. If the content presented in _two consecutive frames in the F1 frame and the F2 frame _of the presentation is different, it can be determined that the moment at which the presented content of the presentation changes is the switching point. For example, suppose that _both F1 and F2 have values of ₂ . If the content presented in the second frame and the third frame of the four frames of the presentation are different, it can be determined that the moment at which the second frame is located is the switching point. Using the switching signal and the content presented in the presentation to determine the switching point can avoid the situation that the determined switching point is inaccurate due to the non-synchronization of the switching signal and the screen switching of the presentation.

可选的，在一些实施例中，该视频分段装置可以根据以下方式确定演示文稿在不同时刻(或者不同帧)呈现的内容是否相同：该视频分段装置比较该演示文稿在不同时刻(或者不同帧)在相同位置的像素值的变化超过预设变化值的个数P，若P大于第一预设阈值P₁，则该视频分段装置该演示文稿呈现的内容发生了变化。可选的，在一些实施例中，像素值的变化可以通过计算像素灰度值的差值的绝对值确定。可选的，在另一些实施例中，像素值的变化可以通过计算三个色彩通道中的差值的绝对值的和确定。Optionally, in some embodiments, the video segmentation apparatus may determine whether the content presented in the presentation at different moments (or different frames) is the same according to the following manner: the video segmentation apparatus compares the presentation at different moments (or The change of the pixel value at the same position in different frames) exceeds the number P of preset change values. If P is greater than the first preset threshold P ₁ , the content presented by the video segmentation device in the presentation has changed. Optionally, in some embodiments, the change of the pixel value may be determined by calculating the absolute value of the difference of the pixel gray value. Optionally, in other embodiments, the change in pixel value may be determined by calculating the sum of absolute values of the differences in the three color channels.

可选的，在一些实施例中，若P大于第二预设阈值P₂(P₂小于P₁)，则该视频分段装置可以根据在后的演示文稿，确定关键词。例如，该视频分段装置确定T₁时刻的演示文稿和T₂时刻的演示文稿(T₂时刻晚于T₁时刻)在相同位置的像素值的变化超过该预设变化值的个数大于P₂且小于P₁。在此情况下，该视频分段装置可以根据T₂时刻的演示文稿确定关键词。Optionally, in some embodiments, if P is greater than the second preset threshold P ₂ (P ₂ is less than P ₁ ), the video segmentation apparatus may determine the keyword according to the subsequent presentation. For example, the video segmentation device determines that the number of pixel values in the presentation at time T1 and the presentation at time T2 ₍ time T2 is later _than time T1 ₎ at the same position where the change in pixel value exceeds the preset change value is greater _than P ₂ and less than P ₁ . _In this case, the video segmentation apparatus may determine the keywords according to the presentation at time T2.

如上所述，该语音信息中还可以包括至少一个停顿点。可选的，在一些实施例中，用于确定分段点的至少一个停顿点可以是从起始时刻到当前时刻的全部停顿点。若步骤302确定的分段点是该待处理视频的第一个分段点，则该起始时刻是该待处理视频的起始时刻。若步骤302确定的分段点是该待处理视频的第k个分段点(k为大于或等于2的正整数)，则该起始时刻是第k-1个分段点所在的时刻。可选的，在另一些实施例中，该视频分段装置还可以根据切换点所在的时刻，确定一个时间范围内的停顿点，该切换点在这个时间范围内。例如，若该切换点位于T₁时刻，则该视频分段装置可以确定出T₁-t到T₁+t时刻的停顿点。As mentioned above, the voice information may further include at least one pause point. Optionally, in some embodiments, the at least one pause point used to determine the segment point may be all pause points from the start time to the current time. If the segment point determined in step 302 is the first segment point of the video to be processed, the start time is the start time of the video to be processed. If the segment point determined in step 302 is the kth segment point of the video to be processed (k is a positive integer greater than or equal to 2), then the start time is the time at which the k-1th segment point is located. Optionally, in other embodiments, the video segmentation apparatus may further determine a pause point within a time range according to the moment at which the switch point is located, and the switch point is within this time range. For example, if the switching point is at time T1, the video segmentation apparatus can determine the pause point from time T1 _- _t to time T1 ₊ t.

该视频分段装置在确定该切换点与至少一个停顿点中的一个停顿点相同的情况下，确定该切换点为该分段点。该视频分段装置在确定该切换点与该至少一个停顿点中的任一个停顿点均不相同的情况下，确定该至少一个停顿点中距离该切换点最近的一个停顿点为该分段点。停顿点与切换点的距离是指停顿点与切换点的时间差。例如，假设该切换点位于T₁时刻，该至少一个停顿点中的一个停顿点位于T₂时刻，T₂与T₁的差为t。假设该至少一个停顿点中除该T₂时刻停顿点外的停顿点到T₁时刻的差均大于t，则该T₂时刻的停顿点为该分段点。若该至少一个停顿点中有两个停顿点到该切换点的距离相同且小于除该两个停顿点外的其他停顿点到该切换点的距离，则可以确定该两个停顿点中的任一个停顿点为该切换点。The video segmentation apparatus determines that the switching point is the segmentation point in the case that the switching point is determined to be the same as one of the at least one pause points. In the case where the video segmentation device determines that the switching point is different from any one of the at least one pause point, the video segmentation device determines a pause point that is closest to the switch point in the at least one pause point as the segmentation point . The distance between the pause point and the switch point refers to the time difference between the pause point and the switch point. For example, assuming that the switching point is located at time T1, _one of the at least one pause point is located at time _T2 , and the difference between _T2 and T1 is _t . Assuming that the difference between the at least _one pause point except the pause point at time T2 and the time T1 is greater _than t, the _pause point at time T2 is the segment point. If the distances from two of the at least one pause point to the switch point are the same and smaller than the distances from the other pause points except the two pause points to the switch point, then any one of the two pause points can be determined. A pause point is the switch point.

该视频分段装置根据该内容描述信息和该语音信息，确定该待处理视频的分段点可以包括：该视频分段装置根据该语音信息、该内容描述信息的关键词和该语音信息中的停顿点，确定该待处理视频的分段点。The video segmentation device determining the segmentation point of the video to be processed according to the content description information and the voice information may include: the video segmentation device according to the voice information, the keywords of the content description information, and the voice information. Pause point, to determine the segmentation point of the video to be processed.

可选的，在一些可能的实现方式中，该待处理视频中可以被划分为多个语音信息片段。第一语音信息片段和第二语音信息片段是该多个语音信息片段中的两个连续的语音信息片段。该第一语音信息片段在该第二语音信息片段之后。该视频分段装置可以根据该第一语音信息片段、该第二语音信息片段、该内容描述信息的关键词和该语音信息中的停顿点，确定第一分段点，该第一分段点是该待处理视频包括的至少一个分段点中的一个。Optionally, in some possible implementations, the video to be processed may be divided into multiple segments of voice information. The first speech information segment and the second speech information segment are two consecutive speech information segments among the plurality of speech information segments. The first piece of voice information follows the second piece of voice information. The video segmentation device can determine a first segmentation point according to the first voice information segment, the second voice information segment, the keywords of the content description information and the pause point in the voice information, the first segmentation point is one of at least one segment point included in the video to be processed.

该视频分段装置可以以窗口长度W和步长S，在该语音信息上截取文字段。该视频分段装置可以截取出至少一个长度为W的文字段。每个长度为W的文字段就是一个语音信息片段。The video segmentation device can cut out text fields on the voice information with a window length W and a step size S. The video segmentation device can cut out at least one text field with a length of W. Each text field of length W is a segment of speech information.

该视频分段装置可以确定第一语音信息片段是否与第二语音信息片段相似。如果该第一语音信息片段与该第二语音信息片段不相似，则可以确定该待处理视频的一个分段点在该第一语音信息片段附近。如果该第二语音信息片段与该第一语音信息片段相似，则继续确定与该第一语音信息片段相邻且位于该第一语音信息片段之后的第三语音信息片段与该第一语音信息片段是否相似。The video segmentation apparatus may determine whether the first segment of speech information is similar to the second segment of speech information. If the first voice information fragment is not similar to the second voice information fragment, it may be determined that a segment point of the video to be processed is near the first voice information fragment. If the second voice information fragment is similar to the first voice information fragment, continue to determine the third voice information fragment adjacent to the first voice information fragment and located after the first voice information fragment and the first voice information fragment Is it similar.

相似度可以作为用于衡量第一语音信息片段与第二语音信息片段是否相似的一个标准。若该第一语音信息片段和该第二语音信息片段的相似度大于或等于一个相似度阈值，则可以认为第一语音信息片段与第二语音信息片段相似；若该第一语音信息片段和该第二语音信息片段的相似度小于该相似度阈值，则可以认为该第一语音信息片段与该第二语音信息片段不相似。The similarity may be used as a criterion for evaluating whether the first voice information segment is similar to the second voice information segment. If the similarity between the first voice information fragment and the second voice information fragment is greater than or equal to a similarity threshold, it can be considered that the first voice information fragment is similar to the second voice information fragment; if the first voice information fragment and the second voice information fragment are similar If the similarity of the second voice information segment is smaller than the similarity threshold, it can be considered that the first voice information segment and the second voice information segment are not similar.

可选的，在一些可能的实现方式中，该视频分段装置可以根据该第一语音信息片段的关键词、该第二语音信息片段的关键词、该第一语音信息片段的内容、该第二语音信息片段的内容和该内容描述信息的关键词，确定该第一语音信息片段和该第二语音信息片段的相似度。Optionally, in some possible implementation manners, the video segmentation device may be based on the keywords of the first voice information fragment, the keywords of the second voice information fragment, the content of the first voice information fragment, the first voice information fragment. The content of the two pieces of speech information and the keywords of the content description information determine the similarity between the first piece of speech information and the second piece of speech information.

该视频分段装置可以确定该第一语音信息片段的关键词。假设从该第一语音信息片段中确定出的关键词数目为N，从该内容描述信息中确定的关键词数目为M，该M个关键词和该N个关键词中没有重复的关键词。The video segmentation device may determine the keyword of the first voice information segment. Assuming that the number of keywords determined from the first voice information segment is N, the number of keywords determined from the content description information is M, and there are no repeated keywords among the M keywords and the N keywords.

该视频分段装置可以根据以下方式确定关键词：The video segmentation device may determine keywords according to the following methods:

步骤1，根据预先设置的停用词表或者根据文本中的每个词的词性，去掉不代表实际意义的词，例如“的”、“这个”、“然后”等。停用词(Stop Words)是人工输入的，非自动化生成的一些字或词。这些词不表示实际意义，在处理自然语言数据之前或之后会被过滤掉。由停用词组成的停用词集合可以称为停用词表。Step 1, according to a preset stop word list or according to the part of speech of each word in the text, remove words that do not represent actual meaning, such as "of", "this", "then" and so on. Stop words (Stop Words) are some words or words that are manually input and not generated automatically. These words do not represent actual meaning and are filtered out before or after processing the natural language data. A set of stop words consisting of stop words can be called a stop word list.

步骤2，统计剩余的词中的每个词在文本中出现的频率。每个词在文本中出现的频率可以根据以下公式确定：Step 2, count the frequency of occurrence of each word in the text in the remaining words. The frequency of occurrence of each word in the text can be determined according to the following formula:

TF(n)＝N(n)/All_N，公式1.1TF(n)=N(n)/All_N, Equation 1.1

其中，TF(n)表示经过了步骤1后，该文本的剩余的词中的第n个词在该文本中出现的频率，N(n)表示该第n个词出现的次数，All_N表示剩余的词的总数目。Among them, TF(n) represents the frequency of occurrence of the nth word in the remaining words of the text after step 1, N(n) represents the number of occurrences of the nth word, and All_N represents the remaining words the total number of words.

步骤3，确定出现频率最高的至少一个词为该文本的关键词。Step 3: Determine at least one word with the highest frequency of occurrence as a keyword of the text.

例如若该文本是内容描述信息，则可以确定出现频率最高的M个词为该内容描述信息的关键词，其中M为大于或等于1的正整数。若该文本是该第一语音信息片段，则可以确定出现频率最高的N个词为该第一语音信息片段的关键词，N为大于或等于1的正整数。若确定出该第一语音信息片段中出现频率该N个词中的一个或多个词与该内容描述信息的关键词相同，则该N个词中删除重复的词，选择后面的词作为该第一语音信息片段的关键词。例如，假设N等于2，M等于1，该内容描述信息的关键词包括“学生”。假设确定出的该第一语音信息片段中出现频率最高的词为“学生”，那么继续确定出现频率第二高的词。若出现频率第二高的词为“学校”，则可以确定“学校”是该第一语音信息片段的一个关键词，继续确定出现频率第三高的词。假设出现频率第三高的词为“课程”，则可以确定“课程”是该第一语音信息片段的另一个关键词。若该文本是该第二语音信息片段，则可以确定出现频率最高的N个词为该第二语音信息片段的关键词，N为大于或等于1的正整数。若确定出该第二语音信息片段中出现频率该N个词中的一个或多个词与该内容描述信息的关键词相同，则该N个词中删除重复的词，选择后面的词作为该第二语音信息片段的关键词。For example, if the text is content description information, the M words with the highest frequency may be determined as keywords of the content description information, where M is a positive integer greater than or equal to 1. If the text is the first voice information segment, it can be determined that N words with the highest frequency of occurrence are keywords of the first voice information segment, and N is a positive integer greater than or equal to 1. If it is determined that the frequency of occurrence of one or more words in the N words is the same as the keyword of the content description information, the repeated words in the N words are deleted, and the following words are selected as the The keyword of the first voice information segment. For example, assuming that N is equal to 2 and M is equal to 1, the keyword of the content description information includes "student". Assuming that the determined word with the highest frequency in the first voice information segment is "student", then continue to determine the word with the second highest frequency. If the word with the second highest frequency is "school", it can be determined that "school" is a keyword of the first voice information segment, and the word with the third highest frequency is continued to be determined. Assuming that the word with the third highest frequency is "course", it can be determined that "course" is another keyword of the first voice information segment. If the text is the second voice information segment, it can be determined that the N words with the highest frequency are keywords of the second voice information segment, and N is a positive integer greater than or equal to 1. If it is determined that the frequency of occurrence of one or more words in the N words in the second speech information segment is the same as the keyword of the content description information, then the repeated words in the N words are deleted, and the following words are selected as the The keyword of the second voice information segment.

可选的，在一些实施例中，该视频分段装置可以根据该第一语音信息片段的关键词、该内容描述信息的关键词和该第一语音信息片段的内容，确定第一关键词向量。具体地，该视频分段装置可以确定该第一语音信息片段的关键词、该内容描述信息的关键词在该第一语音信息片段的内容中出现的频率，该频率就是该第一关键词向量。语音信息片段的内容是指语音信息片段中包括的全部词。例如，假设该内容描述信息的关键词为“学生”，该第一语音信息片段的关键词为“课程”和“学校”。假设上述三个关键词在该第一语音信息片段中出现的频率为分别为0.1，0.2和0.3，则该第一关键词向量为(0.3,0.2,0.1)。Optionally, in some embodiments, the video segmentation apparatus may determine the first keyword vector according to the keyword of the first voice information segment, the keyword of the content description information, and the content of the first voice information segment. . Specifically, the video segmentation device can determine the frequency of the keyword of the first voice information segment and the keyword of the content description information appearing in the content of the first voice information segment, and the frequency is the first keyword vector . The content of the speech information piece refers to all words included in the speech information piece. For example, it is assumed that the keyword of the content description information is "student", and the keywords of the first voice information segment are "course" and "school". Assuming that the frequencies of the above three keywords appearing in the first voice information segment are 0.1, 0.2, and 0.3, respectively, the first keyword vector is (0.3, 0.2, 0.1).

类似的，该视频分段装置也可以根据该第二语音信息片段的关键词、该内容描述信息的关键词和该第二语音信息片段的内容，确定第二关键词向量。具体地，该视频分段装置可以确定该第二语音信息片段的关键词、该内容描述信息的关键词在该第二语音信息片段的内容中出现的频率，该频率就是该第二关键词向量。例如，假设该内容描述信息的关键词为“学生”，该第一语音信息片段的关键词为“早餐”和“营养”。假设上述三个关键词在该第二语音信息片段中出现的频率为分别为0.3，0.25和0.05，则该第二关键词向量为(0.3,0.25,0.05)。Similarly, the video segmentation apparatus may also determine the second keyword vector according to the keyword of the second voice information segment, the keyword of the content description information, and the content of the second voice information segment. Specifically, the video segmentation device can determine the frequency of the keyword of the second voice information segment and the keyword of the content description information appearing in the content of the second voice information segment, and the frequency is the second keyword vector . For example, it is assumed that the keyword of the content description information is "student", and the keywords of the first voice information segment are "breakfast" and "nutrition". Assuming that the frequencies of the above three keywords appearing in the second voice information segment are 0.3, 0.25, and 0.05, respectively, the second keyword vector is (0.3, 0.25, 0.05).

该视频分段装置根据该第一关键词向量和该第二关键词向量确定的距离，若该距离大于预设距离，则可以认为该第一语音信息片段和该第二语音信息片段的相似度小于相似度阈值。在此情况下，该视频分段装置根据该第一语音信息片段确定该分段点。The video segmentation device determines the distance according to the first keyword vector and the second keyword vector. If the distance is greater than the preset distance, it can be considered that the similarity between the first voice information segment and the second voice information segment is similar. less than the similarity threshold. In this case, the video segmentation device determines the segmentation point according to the first voice information segment.

该视频分段装置可以根据以下方式根据该第一关键词向量和该第二关键词向量的确定一个距离：The video segmentation device may determine a distance according to the first keyword vector and the second keyword vector in the following manner:

步骤1，将该第一关键词向量扩展为第一向量，将该第二关键词向量扩展为第二向量，其中该第一向量对应的关键词和该第二向量对应的关键词包括该第一语音信息片段的关键词、该第二语音信息片段的关键词和该内容描述信息的关键词，且该第一向量对应的关键词中没有重复的关键词、该第二向量对应的关键词中没有重复的关键词。Step 1: Expand the first keyword vector into a first vector, and expand the second keyword vector into a second vector, wherein the keywords corresponding to the first vector and the keywords corresponding to the second vector include the first keyword vector. A keyword of a voice information fragment, a keyword of the second voice information fragment and a keyword of the content description information, and the keywords corresponding to the first vector have no repeated keywords, and the keywords corresponding to the second vector There are no duplicate keywords in .

例如，假设该第一关键词向量为(0.3,0.2,0.1)，对应的关键词为“学校”、“课程”和“学生”，假设该第二关键词向量为(0.3,0.25,0.05)，对应的关键词为“学生”、“早餐”和“营养”。在此情况下，该第一向量为(0.3,0.1,0,0.2,0)，对应的关键词为“学校”、“学生”、“早餐”、“课程”、“营养”，该第二向量为(0,0.3,0.25,0,0.05)，对应的关键词为“学校”、“学生”、“早餐”、“课程”、“营养”。For example, assuming that the first keyword vector is (0.3, 0.2, 0.1), the corresponding keywords are "school", "course" and "student", assuming that the second keyword vector is (0.3, 0.25, 0.05) , the corresponding keywords are "students", "breakfast" and "nutrition". In this case, the first vector is (0.3, 0.1, 0, 0.2, 0), the corresponding keywords are "school", "student", "breakfast", "course", "nutrition", the second The vector is (0, 0.3, 0.25, 0, 0.05), and the corresponding keywords are "school", "student", "breakfast", "course", "nutrition".

步骤2，计算该第一向量和该第二向量之间的距离。该第一向量和该第二向量之间的距离就是根据该第一关键词向量和该第二关键词向量确定的距离。Step 2: Calculate the distance between the first vector and the second vector. The distance between the first vector and the second vector is the distance determined according to the first keyword vector and the second keyword vector.

可选的，在一些实施例中，该第一向量和该第二向量之间的距离可以是欧氏距离。由于前后两个语音信息片段中相同的关键词可能会很少。因此如果该第一向量和该第二向量之间的距离是余弦距离，则计算结果中可能会出现很多的0值。因此，选择欧氏距离作为该第一向量和该第二向量之间的距离可能更合适。Optionally, in some embodiments, the distance between the first vector and the second vector may be Euclidean distance. Because the same keywords in the two voice information fragments before and after may be rare. Therefore, if the distance between the first vector and the second vector is a cosine distance, there may be many 0 values in the calculation result. Therefore, it may be more appropriate to choose the Euclidean distance as the distance between the first vector and the second vector.

可选的，在另一些实施例中，该第一向量和该第二向量之间的距离可以是余弦距离。Optionally, in other embodiments, the distance between the first vector and the second vector may be a cosine distance.

除了利用两个相邻语音信息片段的词频向量来确定两个语音信息片段是否相似外，也可以利用其他方式确定两个语音信息片段是否相似。In addition to using word frequency vectors of two adjacent speech information segments to determine whether two speech information segments are similar, other methods may also be used to determine whether two speech information segments are similar.

例如，第一关键词向量和第二关键词向量也可以是词频-逆文档频率、二值词频等。确定该第一关键词向量和该第二关键词向量的距离可以是确定该第一关键词向量和该第二关键词向量的n-范数距离(n为大于或等于1的正整数)，确定该第一关键词向量和该第二关键词向量的相对熵距离。For example, the first keyword vector and the second keyword vector may also be word frequency-inverse document frequency, binary word frequency, or the like. Determining the distance between the first keyword vector and the second keyword vector may be determining an n-norm distance between the first keyword vector and the second keyword vector (n is a positive integer greater than or equal to 1), A relative entropy distance of the first keyword vector and the second keyword vector is determined.

还以上述第一向量(即(0.3,0.1,0,0.2,0))和第二向量(即(0,0.3,0.25,0,0.05))为例，可以对该第一向量和该第二向量进行2值化处理。2值化处理后的第一向量为(1,1,0,1,0)，2值化处理后的第二向量为(0,1,1,0,1)。然后计算1-范数距离，得到第一语音信息片段的关键词和第二语音信息片段的关键词的重复度。关键词的重复度可以认为是一种距离的特殊形式。可以利用关键词的重复度确定第一语音信息片段和该第二语音信息片段是否相似。若该关键词的重复度大于或等于一个预设重复度，则可以认为该第一语音信息片段和该第二语音信息片段相似；若关键词的重复度小于该预设重复度，则可以认为该第一语音信息片段和该第二语音信息片段不相似。可以看出在此情况下该预设重复度可以认为是相似度阈值。Taking the above-mentioned first vector (ie (0.3, 0.1, 0, 0.2, 0)) and second vector (ie (0, 0.3, 0.25, 0, 0.05)) as an example, the first vector and the Binarize the two vectors. The first vector after binarization is (1,1,0,1,0), and the second vector after binarization is (0,1,1,0,1). Then, the 1-norm distance is calculated to obtain the repetition degree of the keywords of the first voice information segment and the keywords of the second voice information segment. Keyword repetition can be thought of as a special form of distance. Whether the first voice information segment and the second voice information segment are similar may be determined by using the repetition degree of the keywords. If the repetition degree of the keyword is greater than or equal to a preset repetition degree, it can be considered that the first voice information fragment and the second voice information fragment are similar; if the repetition degree of the keyword is less than the preset repetition degree, it can be considered that The first voice information segment and the second voice information segment are not similar. It can be seen that the preset repetition degree can be regarded as the similarity threshold in this case.

可选的，在另一些实施例中，关键词的提取也可以根据词频-逆文档频率来确定。词频可以基于公式1.1确定。逆文档频率可以根据以下公式确定：Optionally, in other embodiments, the extraction of keywords may also be determined according to word frequency-inverse document frequency. The word frequency can be determined based on Equation 1.1. The inverse document frequency can be determined according to the following formula:

IDF(n)＝log(Num_Doc/(Doc(n)+1)，公式1.2IDF(n)=log(Num_Doc/(Doc(n)+1), Equation 1.2

其中IDF(n)表示第n个词的逆文档频率，Num_Doc表示语料库中文档总数，Doc(n)表示语料库中包含第n个词的文档数。where IDF(n) represents the inverse document frequency of the nth word, Num_Doc represents the total number of documents in the corpus, and Doc(n) represents the number of documents in the corpus that contain the nth word.

词频-逆文档频率可以根据以下公式确定：Term Frequency - Inverse Document Frequency can be determined according to the following formula:

TF-IDF(n)＝TF(n)×IDF(n)，公式1.3TF-IDF(n)=TF(n)×IDF(n), formula 1.3

其中TF-IDF(n)表示第n个词的词频-逆文档频率。如果关键词是根据词频-逆文档频率确定的，则第一关键词向量是由关键词的词频-逆文档频率组成。where TF-IDF(n) represents the term frequency-inverse document frequency of the nth word. If the keyword is determined according to the word frequency-inverse document frequency, the first keyword vector is composed of the word frequency of the keyword-inverse document frequency.

在根据词频-逆文档频率确定关键词时，可以不需要先将无意义的词去除。When determining keywords according to word frequency-inverse document frequency, it may not be necessary to remove meaningless words first.

可选的，在另一些实施例中，关键词的提取也可以基于词图的文本排名(TextRank)方法。如果关键词是根据基于词图的TextRank确定的，则第一关键词向量可以由词的权值组成。Optionally, in other embodiments, the extraction of keywords may also be based on a word graph text ranking (TextRank) method. If the keywords are determined according to TextRank based on the word graph, the first keyword vector may be composed of the weights of the words.

在该第一语音信息片段和该第二语音信息片段不相似的情况下，该视频分段装置可以根据该第一语音信息片段，确定该分段点。In the case that the first voice information fragment and the second voice information fragment are not similar, the video segmentation apparatus may determine the segmentation point according to the first voice information fragment.

该视频分段装置可以先确定该第一语音信息片段中是否包括停顿点。如果该第一语音信息片段中包括一个停顿点，则可以确定该停顿点是该分段点。如果该第一语音信息片段中包括多个停顿点，则可以确定该多个停顿点中的每个停顿点后的词是否是预设词。该预设词包括有分段意义的连词，例如“接下来”、“下面”、“下一点”等。停顿点后的词是指位于停顿点后的与停顿点相邻的词。如果该多个停顿点中的只有一个停顿点后的词是预设词，则可以确定该停顿点是分段点。如果该多个停顿点中有至少两个停顿点后面的词是预设词，则可以确定该至少两个停顿点中停顿时长的停顿点为该分段点。如果该多个停顿点后面的词都不是该预设词，则可以确定该多个停顿点中停顿时长最长的停顿点为该分段点。如果该第一语音信息片段中没有包括停顿点，则可以根据与该第一语音片段相邻的停顿点确定该分段点。可以理解，与该第一语音片段相邻的停顿点可以有两个，一个位于该第一语音片段之前，另一个位于该第一语音片段之后。该视频分段装置可以根据这两个停顿点到该第一语音信息片段之间的距离，确定该分段点。若停顿点在该第一语音信息片段之前，则该停顿点到该第一语音信息片段之间的距离可以是该停顿点到该第一语音信息片段的起始位置之间的字数或者时间差。若该停顿点在该第一语音信息片段之后，则该停顿点到该第一语音信息片段之间的距离可以是该停顿点到该第一语音信息片段的结束位置之间的字数或者时间差。为便于描述，以下将位于该第一语音信息片段之前的与该第一语音信息片段相邻的停顿点称为前停顿点，该前停顿点到该第一语音信息片段之间的距离称为距离1；将位于该第一语音片段之后的与该第一语音片段相邻的停顿点称为后停顿点，该后停顿点到该第一语音信息片段之前的距离称为距离2。若距离1小于距离2，则可以确定该前停顿点为该分段点；若距离1大于距离2，则可以确定该后停顿点为该分段点。若距离1等于距离2，则可以确定前停顿点后的词(以下简称词1)和后停顿点后的词(以下简称词2)；若词1为该预设词且词2不是该预设词，则确定该前停顿点为该分段点；若词1不是该预设词且词2是该预设词，则确定该后停顿点为该分段点；若词1和词2均为该预设词或者均不是该预设词，则可以确定该前停顿点和该后停顿点中停顿时间最长的一个为该分段点。The video segmentation apparatus may first determine whether a pause point is included in the first voice information segment. If the first voice information segment includes a pause point, it can be determined that the pause point is the segment point. If the first voice information segment includes multiple pause points, it can be determined whether the word after each pause point in the multiple pause points is a preset word. The presupposition includes conjunctions with segmental meanings, such as "next", "below", "next point" and the like. The words after the pause point refer to words that are located after the pause point and adjacent to the pause point. If the word after only one of the plurality of pause points is a preset word, it can be determined that the pause point is a segmentation point. If words following at least two of the plurality of pause points are preset words, it may be determined that the pause point with the pause duration among the at least two pause points is the segment point. If the words following the multiple pause points are not the preset words, it may be determined that the pause point with the longest pause duration among the multiple pause points is the segment point. If the first speech information segment does not include a pause point, the segment point may be determined according to the pause point adjacent to the first speech segment. It can be understood that there may be two pause points adjacent to the first speech segment, one is located before the first speech segment, and the other is located after the first speech segment. The video segmentation device may determine the segmentation point according to the distance between the two pause points and the first voice information segment. If the pause point is before the first voice information segment, the distance between the pause point and the first voice information segment may be the number of words or the time difference between the pause point and the starting position of the first voice information segment. If the pause point is after the first voice information segment, the distance from the pause point to the first voice information segment may be the number of words or the time difference between the pause point and the end position of the first voice information segment. For the convenience of description, the pause point adjacent to the first speech information segment before the first speech information segment is referred to as the front pause point, and the distance between the former pause point and the first speech information segment is referred to as Distance 1; the pause point adjacent to the first speech segment after the first speech segment is called a post-pause point, and the distance from the post-pause point to the front of the first speech information segment is called a distance 2. If the distance 1 is less than the distance 2, it can be determined that the front pause point is the segment point; if the distance 1 is greater than the distance 2, the rear pause point can be determined to be the segment point. If the distance 1 is equal to the distance 2, the word after the front pause point (hereinafter referred to as word 1) and the word after the rear pause point (hereinafter referred to as word 2) can be determined; if word 1 is the preset word and word 2 is not the preset word Assuming a word, then determine that the previous pause point is the segment point; if word 1 is not the preset word and word 2 is the preset word, then determine the rear pause point as the segment point; if word 1 and word 2 are the segment point If both are the preset words or are not the preset words, it can be determined that the segment point with the longest pause time among the front pause point and the rear pause point is the segment point.

如上所述，停顿点是说话者的自然的停顿。因此，该停顿点是有一定时长的。可选的，在一些实施例中，若确定停顿点是分段点，可以确定停顿点的中间时刻为该分段点。可选的，在另一些实施例中，若确定停顿点是分段点，可以确定停顿点的结束时刻为该分段点。可选的，在另一些实施例中，若确定停顿点是分段点，可以确定停顿点的起始时刻为该分段点。As mentioned above, the pause point is the speaker's natural pause. Therefore, the pause point has a certain length of time. Optionally, in some embodiments, if it is determined that the pause point is a segment point, the middle moment of the pause point may be determined to be the segment point. Optionally, in other embodiments, if it is determined that the pause point is a segment point, it may be determined that the end time of the pause point is the segment point. Optionally, in other embodiments, if it is determined that the pause point is a segment point, it may be determined that the start time of the pause point is the segment point.

303，该视频分段装置根据该分段点，对该待处理视频进行分段。303. The video segmentation device segments the to-be-processed video according to the segmentation point.

若该分段点是该待处理视频的第一个分段点，则该分段的起始时刻是该待处理视频的起始时刻，该分段的结束时刻是该分段点。若该分段点是该待处理视频的第k个分段(k为大于或等于2的正整数)，则该分段的起始时刻是第k-1个分段点，该分段的结束时刻是该分段点。If the segment point is the first segment point of the video to be processed, the start time of the segment is the start time of the video to be processed, and the end time of the segment is the segment point. If the segment point is the kth segment of the video to be processed (k is a positive integer greater than or equal to 2), then the start time of the segment is the k-1th segment point, and the The end time is the segment point.

在确定了分段后，该视频分段装置还可以确定该分段的摘要。After the segment is determined, the video segmentation device may also determine a digest of the segment.

304，该视频分段装置可以根据分段语音信息的内容、该分段语音信息的关键词和目标文本的关键词，确定该分段的摘要。该目标文本包括该演示文稿和该内容描述信息中的至少一个。304. The video segmentation apparatus may determine an abstract of the segment according to the content of the segmented speech information, the keywords of the segmented speech information, and the keywords of the target text. The target text includes at least one of the presentation and the content description information.

可选的，在一些实施例中，该视频分段装置可以先确定第三关键词向量，然后根据该第三关键词向量确定该分段的摘要。Optionally, in some embodiments, the video segmentation apparatus may first determine a third keyword vector, and then determine an abstract of the segment according to the third keyword vector.

该视频分段装置可以根据分段语音信息的内容、该分段语音信息的关键词和该目标文本的关键词，确定第三关键词向量，其中该分段语音信息的内容是指组成该分段的语音信息的全部句子。The video segmentation device may determine a third keyword vector according to the content of the segmented speech information, the keywords of the segmented speech information and the keywords of the target text, wherein the content of the segmented speech information refers to the composition of the segmented speech information. The entire sentence of the speech information of the segment.

可以理解的是，若该文本信息中只包括该演示文稿，则该目标文本包括该演示文稿；若该文本信息中只包括该内容描述信息，则该目标文本包括该内容描述信息；若该文本信息中包括该演示文稿和该内容描述信息，则该目标文本包括该演示文稿和该内容描述信息。It can be understood that if the text information only includes the presentation, the target text includes the presentation; if the text information only includes the content description, the target text includes the content description; if the text If the information includes the presentation and the content description information, the target text includes the presentation and the content description.

该视频分段装置确定该分段语音信息的关键词的实现方式、该视频分段装置确定该目标文本的关键词的实现方式与该视频分段装置确定该第一语音信息分段的关键词的实现方式类似。The implementation manner of the video segmentation device determining the keywords of the segmented voice information, the implementation manner of the video segmentation device determining the keywords of the target text, and the video segmentation device determining the keywords of the first voice information segment is implemented similarly.

可选的，在一些实施例中，若该视频分段装置比较该演示文稿在不同时刻(或者不同帧)在相同位置的像素值的变化超过预设变化值的个数P大于第二预设阈值P₂(P₂小于P₁)，则该视频分段装置可以根据在后的演示文稿，确定该目标文本的关键词。例如，该视频分段装置确定T₁时刻的演示文稿和T₂时刻的演示文稿(T₂时刻晚于T₁时刻)在相同位置的像素值的变化超过该预设变化值的个数大于P₂且小于P₁。在此情况下，该视频分段装置可以根据T₂时刻的演示文稿确定该目标文本的关键词。Optionally, in some embodiments, if the video segmentation device compares the number P of the pixel values of the presentation at different times (or different frames) at the same position with changes exceeding the preset change value, the number P is greater than the second preset value. The threshold value P ₂ (P ₂ is smaller than P ₁ ), the video segmentation apparatus can determine the keyword of the target text according to the following presentation. For example, the video segmentation device determines that the number of pixel values in the presentation at time T1 and the presentation at time T2 ₍ time T2 is later _than time T1 ₎ at the same position where the change in pixel value exceeds the preset change value is greater _than P ₂ and less than P ₁ . _In this case, the video segmentation device may determine the keywords of the target text according to the presentation at time T2.

例如，假设从该演示文稿确定出的关键词数目为L，从该内容描述信息中确定的关键词数目为M，从该分段语音信息中确定的关键词数目为Q，该L个关键词、该M个关键词和该Q个关键词中没有重复的关键词。For example, suppose the number of keywords determined from the presentation is L, the number of keywords determined from the content description information is M, the number of keywords determined from the segmented voice information is Q, the L keywords , there are no repeated keywords among the M keywords and the Q keywords.

具体地，该视频分段装置可以先从该内容描述信息中确定M个关键词，然后确定该演示文稿中出现频率最高的L个词。如果该L个词中的一个或多个词也属于该M个关键词，则将该一个或多个词从该L个词中删除，然后继续从该演示文稿中确定出现频率次高的词，直到确定出的L个关键词和该M个关键词没有交集。在此之后，该视频分段装置从该分段语音信息中确定出Q个词。如果该Q个词中的一个或多个词属于该M个关键词或该L个关键词，则将该一个或多个词从该Q个词中删除，然后继续从该分段语音信息中确定出现频率次高的词，直到确定出的Q个关键词与L个关键词和M个关键词都没有交集。Specifically, the video segmentation apparatus may first determine M keywords from the content description information, and then determine L words that appear most frequently in the presentation. If one or more words in the L words also belong to the M keywords, delete the one or more words from the L words, and then continue to determine the word with the second highest frequency from the presentation , until the determined L keywords do not intersect with the M keywords. After that, the video segmentation device determines Q words from the segmented speech information. If one or more words in the Q words belong to the M keywords or the L keywords, delete the one or more words from the Q words, and then continue to delete the one or more words from the segmented speech information Determine the word with the second highest frequency until the determined Q keywords have no intersection with the L keywords and the M keywords.

该第三关键词向量包括该Q个关键词、该L个关键词和该M个关键词在该分段语音信息中出现的频率。可以理解的是，如果该目标文本中不包括该内容描述信息，则M的值为0；如果该目标文本中不包括该演示文稿，则L的值为0。The third keyword vector includes the Q keywords, the L keywords, and the frequency of occurrence of the M keywords in the segmented speech information. It can be understood that if the target text does not include the content description information, the value of M is 0; if the target text does not include the presentation, the value of L is 0.

该视频分段装置可以根据确定的该第三关键词向量，确定该分段的摘要。The video segmentation apparatus may determine an abstract of the segment according to the determined third keyword vector.

具体地，该视频分段装置可以根据该目标文本与该分段语音信息的内容，确定参考文本，其中该参考文本包括J个句子，J为大于或等于1的正整数；根据该分段语音信息的关键词、该目标文本的关键词和该J个句子中的每个句子，确定J个关键词向量；根据该第三关键词向量和该J个关键词向量，确定该分段的摘要。该J个关键词向量中的第j个关键词向量是该分段语音信息的关键词和该目标文本的关键词在第j个句子中出现的频率。Specifically, the video segmentation device can determine a reference text according to the content of the target text and the segmented voice information, where the reference text includes J sentences, and J is a positive integer greater than or equal to 1; according to the segmented voice The keywords of the information, the keywords of the target text, and each of the J sentences, determine J keyword vectors; according to the third keyword vector and the J keyword vectors, determine the abstract of the segment . The jth keyword vector in the J keyword vectors is the frequency of the keywords of the segmented speech information and the keywords of the target text appearing in the jth sentence.

在该目标文本中包括冗余的句子的情况下，将该目标文本中的该冗余的句子删除，得到修正目标文本并将该修正目标文本与该分段语音信息的内容合并，得到该参考文本；在该目标文本不包括该冗余的句子的情况下，将该目标文本与该分段语音信息的内容合并，得到该参考文本。换句话说，在该目标文本包括该演示文稿和该内容描述信息的情况下，该演示文稿中的一个或多个句子可能在该内容描述信息中也出现。在此情况下，将该演示文稿中与该内容描述信息相同的一个或多个句子删除，然后将删除了冗余的句子的演示文稿、内容描述信息和该分段语音信息的内容合并，得到该参考文本。如果该目标文本中不包括冗余的句子，例如该演示文稿中的任一个句子在该内容描述信息中均为出现，或者该目标文本中仅包括该演示文稿和该内容描述信息中的一个，则可以直接将该目标文本与该分段语音信息的内容进行和并，得到该参考文本。In the case that the target text includes redundant sentences, delete the redundant sentences in the target text, obtain a revised target text, and combine the revised target text with the content of the segmented speech information to obtain the reference text; if the target text does not include the redundant sentence, combine the target text with the content of the segmented speech information to obtain the reference text. In other words, where the target text includes the presentation and the content description information, one or more sentences in the presentation may also appear in the content description information. In this case, delete one or more sentences in the presentation that are the same as the content description information, and then combine the presentation with the redundant sentences deleted, the content description information and the content of the segmented speech information to obtain the reference text. If the target text does not include redundant sentences, for example, any sentence in the presentation appears in the content description information, or the target text only includes one of the presentation and the content description information, Then, the target text can be directly combined with the content of the segmented voice information to obtain the reference text.

该视频分段装置根据该第三关键词向量和该J个关键词向量，确定该分段的摘要，包括：该视频分段装置根据该第三关键词向量和该J个关键词向量，确定J个距离，其中该J个距离中的第j个距离是根据该第三关键词向量和该J个关键词向量中的第j个关键词向量确定的，j为大于或等于1且小于或等于J的正整数；确定该J个距离中距离最短的R个距离，R为大于或等于1且小于J的正整数；确定该分段的摘要，其中该分段的摘要包括与该R个距离对应的句子。该视频分段装置根据该第三关键词向量和第j个关键词向量确定第j个距离的具体实现方式与该视频分段装置根据该第一关键词向量和该第二关键词向量确定距离的实现方式类似，区别在于：根据该第三关键词向量和第j个关键词向量确定的第j个距离是欧氏距离；根据该第一关键词向量和该第二关键词向量确定的距离可以是欧氏距离距离，也可以是余弦距离。该第三关键词向量和第j个关键词向量确定的第j个距离不可以是余弦距离的原因是在计算余弦距离时会对第j个关键词向量进行归一化。但是第j个管检测你向量的模长度恰好反映了句子j还有关键词的整体频率，因此不能被归一化。The video segmentation device determines the abstract of the segment according to the third keyword vector and the J keyword vectors, including: the video segmentation device determines according to the third keyword vector and the J keyword vectors. J distances, wherein the jth distance among the J distances is determined according to the third keyword vector and the jth keyword vector among the J keyword vectors, and j is greater than or equal to 1 and less than or A positive integer equal to J; determine the R distances with the shortest distances among the J distances, where R is a positive integer greater than or equal to 1 and less than J; determine the summary of the segment, wherein the summary of the segment includes and the R distances distance to the corresponding sentence. The specific implementation of the video segmentation device determining the jth distance according to the third keyword vector and the jth keyword vector is the same as the video segmentation device determining the distance according to the first keyword vector and the second keyword vector. The implementation is similar, the difference is: the jth distance determined according to the third keyword vector and the jth keyword vector is the Euclidean distance; the distance determined according to the first keyword vector and the second keyword vector It can be Euclidean distance or cosine distance. The reason why the jth distance determined between the third keyword vector and the jth keyword vector cannot be the cosine distance is that the jth keyword vector is normalized when calculating the cosine distance. But the modulo length of the jth tube to detect your vector just reflects the overall frequency of sentence j and keywords, so it cannot be normalized.

上述向量(例如第一关键词向量、第二关键词向量、第三关键词向量和第j个关键词向量)都是关键词在特定文本中出现的频率(即词频)。在另一些实施例中，上述向量也可以根据词到向量(word to vector，word2vec)确定的词向量确定。例如，第一关键词向量可以通过以下步骤确定：利用word2vex确定每个关键词的词向量；将所有关键词的词向量相加后取平均，得到该第一关键词向量。第二关键词向量和第一关键词向量的确定方式类似，在此就不必赘述。又如，第三关键词向量可以通过以下步骤确定：利用word2vex确定每个关键词的词向量；确定每个关键词的词频；根据每个关键词的词频，对全部关键词的词向量取加权平均，得到该第三关键词向量。又如，第j个关键词向量可以通过以下步骤确定：对第j个句子进行分词和去除停用词；利用word2vex确定剩下的每个词的词向量；将所有词向量相加取平均，得到第j个关键词向量。在关键词向量是基于word2vex确定的情况下，第三关键词向量和第j个关键词向量直接的距离可以是余弦距离。The above-mentioned vectors (eg, the first keyword vector, the second keyword vector, the third keyword vector, and the jth keyword vector) are all the frequencies (ie, word frequencies) of keywords appearing in a specific text. In other embodiments, the above-mentioned vector may also be determined according to a word vector determined by a word to vector (word to vector, word2vec). For example, the first keyword vector can be determined by the following steps: determining the word vector of each keyword by using word2vex; adding the word vectors of all keywords and taking an average to obtain the first keyword vector. The manner of determining the second keyword vector is similar to that of the first keyword vector, and details are not described here. For another example, the third keyword vector may be determined by the following steps: determining the word vector of each keyword by using word2vex; determining the word frequency of each keyword; and weighting the word vectors of all keywords according to the word frequency of each keyword. Averaged to obtain the third keyword vector. For another example, the jth keyword vector can be determined by the following steps: perform word segmentation on the jth sentence and remove stop words; use word2vex to determine the word vector of each remaining word; add all word vectors and average them, Get the jth keyword vector. In the case where the keyword vector is determined based on word2vex, the direct distance between the third keyword vector and the jth keyword vector may be a cosine distance.

图4是根据本申请实施例提供的会议流程的示意图。FIG. 4 is a schematic diagram of a conference process provided according to an embodiment of the present application.

401，会议终端1向会议控制服务器传输音视频流1。401, the conference terminal 1 transmits the audio and video stream 1 to the conference control server.

402，会议终端2向会议控制服务器传输音视频流2。402, the conference terminal 2 transmits the audio and video stream 2 to the conference control server.

403，会议终端3向会议控制服务器传输音视频流3。403, the conference terminal 3 transmits the audio and video stream 3 to the conference control server.

404，会议控制服务器确定主会场。404. The conference control server determines the main conference site.

假设会议控制服务器确定的主会场是会议终端1所在的会场。It is assumed that the main conference site determined by the conference control server is the conference site where the conference terminal 1 is located.

405，会议控制服务器将会议数据发送至会议终端2和会议终端3。405 , the conference control server sends the conference data to the conference terminal 2 and the conference terminal 3 .

406，会议终端2和会议终端3存储会议数据。406, the conference terminal 2 and the conference terminal 3 store the conference data.

可选的，在一些实施例中，会议控制服务器也可以将会议数据发送至会议终端1，会议终端1也可以存储会议数据。Optionally, in some embodiments, the conference control server may also send the conference data to the conference terminal 1, and the conference terminal 1 may also store the conference data.

407，会议控制服务器实时对音视频流1进行分段(即确定分段点)并提取各个分段的摘要。407 , the conference control server segments the audio and video stream 1 in real time (ie, determines segment points), and extracts a summary of each segment.

408，会议控制服务器将分段点和摘要发送至会议终端2和会议终端3。这样，会议终端2和会议终端3可以自主选择回看点播放回看视频。当然，在一些实现方式中，会议控制服务器也可以将分段点和摘要发送至会议终端1。408 , the conference control server sends the segment point and the summary to the conference terminal 2 and the conference terminal 3 . In this way, the conference terminal 2 and the conference terminal 3 can independently select a review point to play back the video. Of course, in some implementations, the conference control server may also send the segment points and the summary to the conference terminal 1 .

图5是根据本申请实施例提供的视频分段方法的示意性流程图。FIG. 5 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.

501，视频分段装置确定会议预定中是否包括会议内容相关文字。换句话说，视频分段装置可以确定该待处理视频是否包括内容描述信息。若确定结果为是(即该待处理视频包括内容描述信息)，则执行步骤502；若确定结果为否(即该待处理视频不包括内容描述信息)，则执行步骤503。501. The video segmentation apparatus determines whether the conference reservation includes text related to the conference content. In other words, the video segmentation apparatus can determine whether the to-be-processed video includes content description information. If the determination result is yes (that is, the to-be-processed video includes content description information), step 502 is performed; if the determination result is no (that is, the to-be-processed video does not include content description information), step 503 is performed.

502，该视频分段装置提取该会议内容相关文字的关键词。换句话说，该视频分段装置确定该内容描述信息的关键词。502. The video segmentation apparatus extracts keywords of texts related to the conference content. In other words, the video segmentation apparatus determines the keywords of the content description information.

在确定了该内容描述信息的关键词后，可以执行步骤503。After the keyword of the content description information is determined, step 503 may be performed.

503，该视频分段装置确定待处理视频中是否有屏幕展示演示文稿。换句话说，该视频分段装置可以确定该待处理视频是否包括演示文稿，且该演示文稿是通过屏幕展示的。若确定结果为是(即该待处理视频包括演示文稿)，则执行步骤504。若确定结果为否(即该待处理视频不包括演示文稿)，则执行步骤505。504，该视频分段装置确定用于展示该演示文稿的屏幕的位置。该视频分段装置在确定了该屏幕的位置之后，可以执行步骤506。503. The video segmentation apparatus determines whether there is a screen presentation presentation in the video to be processed. In other words, the video segmentation device can determine whether the video to be processed includes a presentation, and the presentation is presented on a screen. If the determination result is yes (that is, the to-be-processed video includes a presentation), step 504 is executed. If the determination result is no (that is, the video to be processed does not include a presentation), then step 505 is performed. In 504, the video segmentation apparatus determines the position of the screen for displaying the presentation. After the video segmentation apparatus determines the position of the screen, step 506 may be performed.

505，该视频分段装置确定是否有通过辅流传输的演示文稿。换句话说，在一些可能的实现方式中，会议发言人可能不会通过屏幕展示演示文稿，但是会通过辅流将演示文稿上传至会议控制服务器。其他会场中的会议终端可以根据该辅流获取该会议发言人在发言过程中使用的演示文稿。若确定结果为是(即有通过辅流传输的演示文稿)，则执行步骤506。若确定结果为否(即没有通过辅流传输的演示文稿)，则可以根据语音信息，确定该待处理视频的分段点。505. The video segmentation apparatus determines whether there is a presentation delivered through the auxiliary stream. In other words, in some possible implementations, the conference speaker may not present the presentation through the screen, but upload the presentation to the conference control server through an auxiliary stream. The conference terminals in other conference sites can obtain the presentation used by the conference speaker during the speaking process according to the auxiliary stream. If the determination result is yes (that is, there is a presentation transmitted through the auxiliary stream), step 506 is executed. If the determination result is no (that is, there is no presentation transmitted through the auxiliary stream), the segmentation point of the video to be processed may be determined according to the voice information.

506，该视频分段装置确定上一分段点到当前时刻的时长是否超过第一预设时长。若该视频分段装置确定上一分段点到当前时刻的时长大于该第一预设时长(即确定结果为是)，则执行步骤507。若该视频分段装置确定上一分段点到当前时刻的时长不大于该第一预设时长，则执行步骤508。可以理解的是，若该视频分段装置确定的分段点是第一个分段点，则上一分段点是指待处理视频的起始时刻。为了便于描述，可以将上衣分段点到当前时刻的时长称为演示时长。506. The video segmentation apparatus determines whether the duration from the last segment point to the current moment exceeds the first preset duration. If the video segmentation apparatus determines that the duration from the last segment point to the current moment is greater than the first preset duration (ie, the determination result is yes), step 507 is executed. If the video segmentation device determines that the duration from the last segment point to the current moment is not greater than the first preset duration, step 508 is executed. It can be understood that, if the segmentation point determined by the video segmentation device is the first segmentation point, the previous segmentation point refers to the start moment of the video to be processed. For the convenience of description, the duration from the point of the shirt segment to the current moment may be referred to as the presentation duration.

507，该视频分段装置根据内容描述信息和语音信息，确定该待处理视频的分段点。507. The video segmentation apparatus determines a segmentation point of the to-be-processed video according to the content description information and the voice information.

508，该视频分段装置根据演示文稿和语音信息，确定该待处理视频的分段点。该视频分段装置根据该演示文稿和语音信息，确定该待处理视频的分段点的具体实现方式，可以参考图3所示的实施例，在此就不必赘述。508. The video segmentation apparatus determines a segmentation point of the video to be processed according to the presentation and the voice information. The specific implementation manner of the video segmentation device determining the segmentation point of the to-be-processed video according to the presentation and the voice information may refer to the embodiment shown in FIG. 3 , which is unnecessary to describe here.

该视频分段装置在确定了该待处理视频的分段点后，可以执行步骤509和步骤510。After the video segmentation device determines the segmentation point of the video to be processed, step 509 and step 510 may be performed.

509，该视频分段装置确定分段语音信息以及该分段语音信息的关键词，分段语音信息是在分段点的上一个分段点和该分段点之间的语音信息。可以理解的是，若该分段点是待处理视频的第一个分段点，则该分段语音信息是该待处理视频的起始时刻到该分段点之间的语音信息。509. The video segmentation apparatus determines the segmented voice information and a keyword of the segmented voice information, where the segmented voice information is the voice information between the previous segmented point of the segmented point and the segmented point. It can be understood that, if the segment point is the first segment point of the video to be processed, the segment voice information is the voice information between the start time of the to-be-processed video and the segment point.

510，该视频分段装置根据该分段语音信息，该分段语音信息的关键词和目标文本的关键词，确定分段摘要。步骤509和510的具体实现方式可以参考图3所示的实施例，在此就不必赘述。510. The video segmentation apparatus determines a segment abstract according to the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text. For the specific implementation manner of steps 509 and 510, reference may be made to the embodiment shown in FIG. 3 , which is not repeated here.

可以理解的是，在另一些可能的实现方式中，该视频分段装置在对视频进行分段和提取摘要的过程中，可以先确定待处理视频中是否有通过屏幕展示的演示文稿，然后再确定会议预定中是否包括会议内容相关文字，最后再确定是否有通过辅流传输演示文稿。在另一些可能的实现方式中，该视频分段装置还可以先确定是否有通过辅流传输的演示文稿，然后在确定会议预定中是否包括会议内容相关文字，最后再确定待处理视频中是否有通过屏幕展示的演示文稿。It can be understood that, in some other possible implementation manners, in the process of segmenting the video and extracting the abstract, the video segmentation device may first determine whether there is a presentation displayed on the screen in the video to be processed, and then Determine whether the conference reservation includes text related to the conference content, and finally determine whether the presentation is transmitted through the auxiliary stream. In some other possible implementation manners, the video segmentation device may also first determine whether there is a presentation delivered through an auxiliary stream, then determine whether the conference reservation includes text related to the conference content, and finally determine whether there is any text in the to-be-processed video. Presentation on screen.

下面将结合图6对该视频分段装置如何根据内容描述信息和语音信息，确定该待处理视频的分段点进行描述。此外，该视频分段装置如何根据该语音信息，确定该待处理视频的分段点的实现方式也可以参见图6。The following will describe how the video segmentation apparatus determines the segmentation point of the video to be processed according to the content description information and the voice information with reference to FIG. 6 . In addition, the implementation manner of how the video segmentation apparatus determines the segmentation point of the to-be-processed video according to the voice information can also be referred to FIG. 6 .

图6是根据本申请实施例提供的一种视频分段的方法的示意性流程图。FIG. 6 is a schematic flowchart of a method for segmenting a video according to an embodiment of the present application.

601，该视频分段装置以窗口长度W和步长S，持续在该语音信息上截取语音信息片段。601 , the video segmentation apparatus continues to intercept voice information segments from the voice information with the window length W and the step size S.

602，该视频分段装置提取每个语音信息片段的关键词。具体地，该视频分段装置从每个语音信息片段中提取N个关键词。602, the video segmentation apparatus extracts the keywords of each voice information segment. Specifically, the video segmentation device extracts N keywords from each voice information segment.

如果该视频分段装置提取过内容描述信息的关键词，则在步骤602之后可以执行步骤603；若该视频分段装置没有提取过内容描述信息的关键词，则在步骤602之后可以执行步骤604。该视频分段装置提取过内容描述信息的关键词意味着该视频分段装置确定该待处理视频包括内容描述信息。在此情况下，该视频分段装置确定的分段点是根据内容描述信息和语音信息确定的。该视频分段装置没有提取过内容描述信息的关键词意味着该视频分段装置确定该待处理视频不包括内容描述信息。在此情况下，该视频分段装置确定的分段点是根据语音信息确定的。If the video segmentation device has extracted the keywords of the content description information, then step 603 may be performed after step 602; if the video segmentation device has not extracted the keywords of the content description information, then step 604 may be performed after step 602. . The video segmentation device has extracted the keywords of the content description information, which means that the video segmentation device determines that the to-be-processed video includes content description information. In this case, the segmentation point determined by the video segmentation device is determined according to the content description information and the voice information. The video segmentation device has not extracted the keywords of the content description information, which means that the video segmentation device determines that the video to be processed does not include content description information. In this case, the segmentation point determined by the video segmentation device is determined according to the voice information.

603，该视频分段装置确定第i个语音信息片段中的关键词和该内容描述信息的关键词在第i个语音信息片段中的词频向量C_i。603. The video segmentation apparatus determines the keyword in the ith voice information segment and the word frequency vector C_i of the keyword of the content description information in the ith voice information segment.

用于确定第i个语音信息片段中的关键词的方法可以参见图3所示的实施例。具体地，可以参考图3所示实施例中确定该第一语音信息片段的关键词的确定方法，在此就不必赘述。用于确定该内容描述信息的关键词的方法可以参加图3所示的实施例，在此就不必赘述。该视频分段装置确定第i个语音信息片段中的关键词和该内容描述信息的关键词在第i个语音信息片段中的词频向量的实现方式可以参加图3所示实施例中确定第一关键词向量的确定方式，在此就不必赘述。For the method for determining the keyword in the ith voice information segment, reference may be made to the embodiment shown in FIG. 3 . Specifically, reference may be made to the method for determining the keyword of the first voice information segment in the embodiment shown in FIG. 3 , which is not repeated here. The method for determining the keyword of the content description information can be found in the embodiment shown in FIG. 3 , and it is unnecessary to describe it here. The implementation of the video segmentation device for determining the keyword in the i-th speech information segment and the word frequency vector of the keyword in the content description information in the i-th speech information segment can refer to the embodiment shown in FIG. 3 to determine the first The method for determining the keyword vector does not need to be repeated here.

604，该视频分段装置确定第i个语音信息片段中的关键词在第i个语音信息片段中的词频向量C_i。第i个语音信息片段中的关键词在第i个语音信息片段中的词频向量的确定方式与第i个语音信息片段中的关键词和该内容描述信息的关键词在第i个语音信息片段中的词频向量C_i确定方式类似，在此就不必赘述。604. The video segmentation apparatus determines a word frequency vector C_i of the keyword in the i-th speech information segment in the i-th speech information segment. The method of determining the word frequency vector of the keyword in the i-th speech information segment in the i-th speech information segment is the same as that of the keyword in the i-th speech information segment and the keyword of the content description information in the i-th speech information segment. The word frequency vector C_i in is determined in a similar manner, so it is unnecessary to describe it here.

该视频分段装置在执行了步骤603或步骤604之后，可以依次执行步骤605和步骤606。After performing step 603 or step 604, the video segmentation apparatus may perform step 605 and step 606 in sequence.

605，该视频分段装置确定C_i和C_(i-1)之间的距离。C_(i-1)是该视频分段装置确定第i-1个语音信息片段的关键词(或者第i-1个语音信息片段的关键词和该内容描述信息的关键词)在第i-1个语音信息片段中的词频向量。第i-1个语音信息片段是第i个语音信息片段之前的一个语音信息片段605, the video segmentation apparatus determines the distance between C_i and C_(i-1). C_(i-1) is the key word of the i-1 th voice information segment determined by the video segmentation device (or the keyword of the i-1 th voice information segment and the keyword of the content description information) in the i- th Word frequency vector in 1 speech information segment. The i-1 th voice information fragment is a voice information fragment before the ith voice information fragment

606，若C_i和C_(i-1)之间的距离大于预设距离，则可以确定分段点位于第i个语音信息片段前后。该视频分段装置在确定出分段点位于第i个语音信息片段前后的情况下，可以根据停顿点确定该分段点。该视频分段装置根据停顿点确定分段点的具体实现方式可以参考图3所示的实施例，在此就不必赘述。606. If the distance between C_i and C_(i-1) is greater than the preset distance, it may be determined that the segmentation point is located before and after the i-th speech information segment. In the case that the video segmentation device determines that the segmentation point is located before and after the ith voice information segment, the segmentation point can be determined according to the pause point. The specific implementation manner of the video segmentation apparatus determining the segmentation point according to the pause point may refer to the embodiment shown in FIG. 3 , and it is unnecessary to describe it here.

若C_i和C_(i-1)之间的距离小于或等于该预设距离，则可以认为分段点不在第i个语音信息片段和第i-1个语音信息片段中。在此情况下，可以继续确定下一个语音信息片段的词频向量和第i个语音信息片段的词频向量。If the distance between C_i and C_(i-1) is less than or equal to the preset distance, it can be considered that the segmentation point is not in the i-th speech information segment and the i-1-th speech information segment. In this case, the word frequency vector of the next piece of speech information and the word frequency vector of the i-th piece of speech information can continue to be determined.

图7是根据本申请实施例提供的视频分段装置的结构框图。如图7所示，视频分段装置700包括获取单元701和处理单元702。FIG. 7 is a structural block diagram of a video segmentation apparatus provided according to an embodiment of the present application. As shown in FIG. 7 , the video segmentation apparatus 700 includes an acquisition unit 701 and a processing unit 702 .

获取单元701，用于获取待处理视频的文本信息和该待处理视频的语音信息，其中该文本信息包括该待处理视频中的演示文稿和该待处理视频的内容描述信息中的至少一个。The obtaining unit 701 is configured to obtain text information of the video to be processed and voice information of the video to be processed, wherein the text information includes at least one of a presentation in the video to be processed and content description information of the video to be processed.

处理单元702，用于根据该文本信息和该语音信息，确定该待处理视频的分段点。The processing unit 702 is configured to determine the segment point of the video to be processed according to the text information and the voice information.

处理单元702，还用于根据该分段点，对该待处理视频进行分段。The processing unit 702 is further configured to segment the video to be processed according to the segment point.

获取单元701和处理单元702的具体功能和有益效果可以参见图3至图6所示的方法，在此就不再赘述。For the specific functions and beneficial effects of the acquiring unit 701 and the processing unit 702, reference may be made to the methods shown in FIG. 3 to FIG. 6 , which will not be repeated here.

图8是根据本申请实施例提供的视频分段装置的结构框图。图8所示的视频分段装置800包括：处理器801、存储器802和收发器803。FIG. 8 is a structural block diagram of a video segmentation apparatus provided according to an embodiment of the present application. The video segmentation apparatus 800 shown in FIG. 8 includes: a processor 801 , a memory 802 and a transceiver 803 .

处理器801、存储器802和收发器803之间通过内部连接通路互相通信，传递控制和/或数据信号。The processor 801, the memory 802 and the transceiver 803 communicate with each other through an internal connection path to transmit control and/or data signals.

上述本申请实施例揭示的方法可以应用于处理器801中，或者由处理器801实现。处理器801可能是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法的各步骤可以通过处理器801中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器801可以是通用处理器、数字信号处理器(digital signal processor，DSP)、专用集成电路(application specific integrated circuit，ASIC)、现成可编程门阵列(fieldprogrammable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存取存储器(random access memory，RAM)、闪存、只读存储器(read-only memory，ROM)、可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器802，处理器801读取存储器802中的指令，结合其硬件完成上述方法的步骤。The methods disclosed in the above embodiments of the present application may be applied to the processor 801 or implemented by the processor 801 . The processor 801 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method may be completed by an integrated logic circuit of hardware in the processor 801 or an instruction in the form of software. The above-mentioned processor 801 may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (fieldprogrammable gate array, FPGA), or any other available processor. Programming logic devices, discrete gate or transistor logic devices, discrete hardware components. The methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. Software modules can be located in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory or electrically erasable programmable memory, registers, etc. in the storage medium. The storage medium is located in the memory 802, and the processor 801 reads the instructions in the memory 802, and completes the steps of the above method in combination with its hardware.

可选的，在一些实施例中，存储器802可以存储用于执行如图3至图6所示方法中视频分段装置执行的方法的指令。处理器801可以执行存储器802中存储的指令结合其他硬件(例如收发器803)完成如图3至图6所示方法中视频分段装置的步骤，具体工作过程和有益效果可以参见图3至图6所示实施例中的描述。Optionally, in some embodiments, the memory 802 may store instructions for performing the method performed by the video segmentation apparatus in the methods shown in FIGS. 3 to 6 . The processor 801 can execute the instructions stored in the memory 802 in combination with other hardware (such as the transceiver 803) to complete the steps of the video segmentation device in the method shown in FIG. 3 to FIG. 6 as described in the embodiment shown.

本申请实施例还提供一种芯片，该芯片包括收发单元和处理单元。其中，收发单元可以是输入输出电路、通信接口；处理单元为该芯片上集成的处理器或者微处理器或者集成电路。该芯片可以执行上述方法实施例中视频分段装置的方法。Embodiments of the present application further provide a chip, where the chip includes a transceiver unit and a processing unit. The transceiver unit may be an input/output circuit or a communication interface; the processing unit may be a processor, a microprocessor or an integrated circuit integrated on the chip. The chip can execute the method of the video segmentation apparatus in the above method embodiments.

本申请实施例还提供一种计算机可读存储介质，其上存储有指令，该指令被执行时执行上述方法实施例中视频分段装置的方法。Embodiments of the present application further provide a computer-readable storage medium, on which instructions are stored, and when the instructions are executed, the methods of the video segmentation apparatus in the above method embodiments are executed.

本申请实施例还提供一种包含指令的计算机程序产品，该指令被执行时执行上述方法实施例中视频分段装置的方法。The embodiments of the present application further provide a computer program product including instructions, when the instructions are executed, the methods of the video segmentation apparatus in the above method embodiments are executed.

本领域普通技术人员可以意识到，结合本申请中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed in this application can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，ROM)、随机存取存储器(Random Access Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A method of video segmentation, comprising:

the video segmentation device acquires text information of a video to be processed and voice information of the video to be processed, wherein the text information comprises at least one of a presentation in the video to be processed and content description information of the video to be processed, and the content description information is information for describing speaking content;

the video segmentation device determines segmentation points of the video to be processed according to the text information and the voice information;

and the video segmenting device segments the video to be processed according to the segmenting points.

2. The method of claim 1, wherein in the case that the text information includes the presentation, the video segmentation apparatus determining a segmentation point of the video to be processed according to the text information and the voice information comprises:

determining a switching point of the presentation, wherein the contents of the presentation before and after the switching point are different;

determining at least one stop point according to the voice information;

determining the segment point according to the switching point and the at least one pause point.

3. The method of claim 2, wherein the determining the segmentation point from the switching point and the at least one suspension point comprises:

determining the switching point as the segment point if it is determined that the switching point is the same as one of the at least one pause point;

and under the condition that any stopping point of the at least one stopping point is determined to be different from the switching point, determining one stopping point which is closest to the switching point in the at least one stopping point as the segmentation point.

4. The method of claim 2 or 3, wherein the determining a switch point for the presentation comprises: and determining the moment when the switching signal for instructing the switching of the content of the presentation is acquired as the switching point.

5. The method according to claim 2 or 3, wherein the text information further includes the content description information, and before the video segmentation apparatus determines the segmentation point of the video to be processed according to the text information and the speech information, the method further comprises:

and determining that the presentation time length of the current page of the presentation is less than or equal to a first preset time length and greater than a second preset time length.

6. The method of claim 1, wherein in the case that the text information includes the content description information, the video segmentation apparatus determining a segmentation point of the video to be processed according to the text information and the speech information comprises:

and determining the segmentation point of the video to be processed according to the voice information, the key word of the content description information and the stop point in the voice information.

7. The method of claim 6, wherein the voice information includes a first voice information segment and a second voice information segment, wherein the second voice information segment is a voice information segment preceding and adjacent to the first voice information segment,

the determining the segmentation point of the video to be processed according to the voice information, the keyword of the content description information and the stop point in the voice information comprises:

and determining a first segmentation point according to the first voice information segment, the second voice information segment, the keyword of the content description information and the pause point in the voice information, wherein the segmentation point of the video to be processed comprises the first segmentation point.

8. The method of claim 7, wherein determining a first segmentation point based on the first segment of speech information, the second segment of speech information, the keyword of the content description information, and a pause point in the speech information comprises:

determining the similarity of the first voice information segment and the second voice information segment according to the keywords of the first voice information segment, the keywords of the second voice information segment, the content of the first voice information segment, the content of the second voice information segment and the keywords of the content description information;

determining that the similarity of the first voice information segment and the second voice information segment is smaller than a similarity threshold;

and determining the first segmentation point according to the pause point in the voice information.

9. The method of claim 8, wherein the pause point in the speech information comprises a pause point within the first segment of speech information or a pause point adjacent to the first segment of speech information, and wherein determining the first segmentation point based on the pause point in the speech information comprises:

and determining the first segmentation point according to at least one of the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, pause duration and words adjacent to the pause points.

10. The method according to any of claims 6 to 9, wherein the text information further comprises the presentation, and before the video segmentation means determines the segmentation points of the video to be processed from the text information and the speech information, the method further comprises:

determining that the presentation time length of the current page of the presentation is greater than a first preset time length; or alternatively

And determining that the presentation time length of the current page of the presentation is less than or equal to a second preset time length.

11. The method of any of claims 1-3, 6-9, further comprising: the video segmentation apparatus determines the digest of the segment according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of a target text, wherein the target text includes at least one of the presentation and the content description information.

12. The method according to any one of claims 1 to 3 and 6 to 9, wherein the video to be processed is a real-time video stream, and the voice information of the video to be processed is the voice information of the real-time video stream from a start time or a last segmentation point of the real-time video stream to a current time.

13. A video segmentation apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring text information of a video to be processed and voice information of the video to be processed, the text information comprises at least one of a presentation in the video to be processed and content description information of the video to be processed, and the content description information is information used for describing speaking content;

the processing unit is used for determining the segmentation points of the video to be processed according to the text information and the voice information;

and the processing unit is further used for segmenting the video to be processed according to the segmentation points.

14. The video segmentation apparatus as set forth in claim 13, wherein the processing unit is specifically configured to determine a switching point of the presentation based on the text information and the speech information if the text information includes the presentation, the presentation being presented with different contents before and after the switching point;

determining at least one stop point according to the voice information;

determining the segmentation point according to the switching point and the at least one pause point.

15. Video segmentation unit as claimed in claim 14, characterized in that the processing unit is specifically adapted to

and in the case that any one of the at least one pause point is determined to be different from the switching point, determining one of the at least one pause point closest to the switching point as the segmentation point.

16. The video segmenting device according to claim 14 or 15, characterized in that the processing unit is specifically configured to determine a time when a switch signal instructing to switch the content of the presentation is acquired as the switch point.

17. The video segmentation apparatus according to claim 14 or 15, wherein the processing unit is further configured to determine that a presentation time length of a current page of the presentation is less than or equal to a first preset time length and greater than a second preset time length before determining a segmentation point of the video to be processed according to the text information and the speech information in a case that the text information further includes the content description information.

18. The video segmentation apparatus as claimed in claim 13, wherein the processing unit is specifically configured to determine the segmentation point of the video to be processed according to the speech information, the keyword of the content description information, and a pause point in the speech information, if the text information includes the content description information.

19. The video segmentation apparatus of claim 18 wherein the voice information includes a first segment of voice information and a second segment of voice information, wherein the second segment of voice information is a segment of voice information preceding and adjacent to the first segment of voice information,

the processing unit is specifically configured to determine a first segmentation point according to the first speech information segment, the second speech information segment, the keyword of the content description information, and a pause point in the speech information, where the segmentation point of the video to be processed includes the first segmentation point.

20. The video segmentation apparatus as set forth in claim 19, wherein the processing unit is specifically configured to determine the similarity between the first voice message segment and the second voice message segment according to the keyword of the first voice message segment, the keyword of the second voice message segment, the content of the first voice message segment, the content of the second voice message segment, and the keyword of the content description information;

21. The video segmentation apparatus as set forth in claim 20, wherein the pause point in the speech information comprises a pause point in the first segment of speech information or a pause point adjacent to the first segment of speech information, and wherein the processing unit is specifically configured to determine the first segmentation point based on at least one of a number of pause points in the first segment of speech information, a number of pause points adjacent to the first segment of speech information, a pause duration, and a word adjacent to the pause point.

22. The video segmentation apparatus according to any one of claims 18 to 21, wherein the processing unit is further configured to determine that a presentation duration of a current page of the presentation is greater than a first preset duration before determining a segmentation point of the video to be processed according to the text information and the voice information in a case that the text information further includes the presentation; or

23. The video segmentation apparatus as set forth in any one of claims 13 to 15 and 18 to 21, wherein the processing unit is further configured to determine the abstract of the segment according to a content of the segmented voice information, a keyword of the segmented voice information, and a keyword of a target text, wherein the target text comprises at least one of the presentation and the content description information.

24. The video segmentation apparatus as claimed in any one of claims 13 to 15 and 18 to 21, wherein the video to be processed is a real-time video stream, and the voice information of the video to be processed is the voice information of the real-time video stream from a start time or a last segmentation point of the real-time video stream to a current time.