CN104751033B

CN104751033B - A kind of user behavior authentication method and device based on audio-video document

Info

Publication number: CN104751033B
Application number: CN201510094395.0A
Authority: CN
Inventors: 顾少丰; 郑可爽
Original assignee: Shanghai PPDai Financial Information Services Co Ltd
Current assignee: Shanghai PPDai Financial Information Services Co Ltd
Priority date: 2015-03-03
Filing date: 2015-03-03
Publication date: 2017-11-24
Anticipated expiration: 2035-03-03
Also published as: CN104751033A

Abstract

This application provides a user behavior authentication method and device based on audio and video files. The method includes: judging whether the video of the audio-video file uploaded by the user satisfies the authentication condition; if the video of the audio-video file satisfies the authentication condition, converting the audio of the audio-video file into the corresponding text to be authenticated; The user behavior is authenticated according to the text to be authenticated corresponding to the audio and video files. This application realizes automatic authentication of user behavior based on audio and video files, saves human resources, and improves authentication efficiency.

Description

A user behavior authentication method and device based on audio and video files

技术领域technical field

本申请涉及计算机技术领域，尤其涉及一种基于音视频文件的用户行为认证方法和装置。The present application relates to the field of computer technology, in particular to a method and device for user behavior authentication based on audio and video files.

背景技术Background technique

随着互联网技术的广泛发展，用户可以通过互联网实现各种业务操作。比如：用户可以通过互联网进行交流通信，用户也可以通过互联网进行财务借贷等。要实现上述业务，服务提供方往往需要先对用户的行为进行认证，比如：通过用户的行为认证用户是否已经清楚了解业务操作的规定等。With the extensive development of Internet technology, users can implement various business operations through the Internet. For example: users can communicate through the Internet, and users can also conduct financial loans through the Internet. To realize the above-mentioned business, the service provider often needs to authenticate the user's behavior first, for example: through the user's behavior to verify whether the user has clearly understood the regulations of business operations, etc.

相关技术中，用户可以上传一段音视频文件，然后由服务提供方基于该音视频文件对用户的行为进行认证。然而，目前服务提供方通常采用人工认证的方式，效率较低，且成本较高。因此，一种基于音视频文件的用户行为自动认证方案是亟待提供的。In related technologies, a user can upload an audio and video file, and then the service provider authenticates the user's behavior based on the audio and video file. However, at present, service providers usually use manual authentication, which has low efficiency and high cost. Therefore, an automatic user behavior authentication scheme based on audio and video files is urgently needed.

发明内容Contents of the invention

有鉴于此，本申请提供一种基于音视频文件的用户行为认证方法和装置。In view of this, the present application provides a user behavior authentication method and device based on audio and video files.

具体地，本申请是通过如下技术方案实现的：Specifically, this application is achieved through the following technical solutions:

一种基于音视频文件的用户行为认证方法，所述方法包括：A user behavior authentication method based on audio and video files, said method comprising:

判断用户上传的音视频文件的视频是否满足认证条件；Determine whether the video of the audio and video files uploaded by the user meets the authentication conditions;

如果所述音视频文件的视频满足所述认证条件，则将所述音视频文件的音频转换为对应的待认证文本；If the video of the audio-video file satisfies the authentication condition, then convert the audio of the audio-video file into corresponding text to be authenticated;

根据所述音视频文件对应的待认证文本对用户行为进行认证。The user behavior is authenticated according to the text to be authenticated corresponding to the audio and video files.

进一步地，所述判断用户上传的音视频文件的视频是否满足认证条件，包括：Further, the judging whether the video of the audio-video file uploaded by the user satisfies the authentication condition includes:

从所述音视频文件中提取多张图片；Extract multiple pictures from the audio and video files;

判断所述多张图片的背景的相似度是否大于等于第一阈值；Judging whether the similarity of the backgrounds of the plurality of pictures is greater than or equal to a first threshold;

如果所述多张图片的背景的相似度大于等于所述第一阈值，则确认所述音视频文件的视频满足所述认证条件。If the similarity between the backgrounds of the multiple pictures is greater than or equal to the first threshold, it is confirmed that the video of the audio and video file satisfies the authentication condition.

进一步地，所述将所述音视频文件的音频转换为对应的待认证文本，包括：Further, converting the audio of the audio-video file into corresponding text to be authenticated includes:

根据预设的第一分割规则将所述音视频文件的音频分为N个音频片段，N为大于1的自然数；Divide the audio of the audio-video file into N audio segments according to the preset first segmentation rule, where N is a natural number greater than 1;

将所述音频片段中的音频转换为对应的初始文本；converting the audio in the audio segment into corresponding initial text;

组合所述N个音频片段对应的N个初始文本，以得到所述音视频文件对应的待认证文本。The N initial texts corresponding to the N audio segments are combined to obtain the text to be authenticated corresponding to the audio and video files.

进一步地，在将所述音频片段中的音频识别为对应的初始文本之后，所述方法还包括：Further, after identifying the audio in the audio clip as the corresponding initial text, the method further includes:

根据预设的第二分割规则从所述音视频文件的音频中提取出N-1个校验音频片段；Extracting N-1 verification audio segments from the audio of the audio and video file according to the preset second segmentation rule;

将所述校验音频片段中的音频转换为对应的校验文本；converting the audio in the verification audio segment into corresponding verification text;

根据所述N-1个校验文本判断所述N个音频片段对应的N个初始文本是否准确；Judging whether the N initial texts corresponding to the N audio segments are accurate according to the N-1 verification texts;

如果所述N个音频片段对应的N个初始文本准确，则组合所述N个音频片段对应的N个初始文本。If the N initial texts corresponding to the N audio segments are accurate, combine the N initial texts corresponding to the N audio segments.

进一步地，所述根据所述音视频文件对应的待认证文本对用户行为进行认证，包括：Further, the authentication of user behavior according to the text to be authenticated corresponding to the audio and video files includes:

计算所述音视频文件对应的待认证文本和预设的认证文本的文本相似度；Calculating the text similarity between the text to be authenticated corresponding to the audio and video file and the preset authentication text;

当所述文本相似度大于等于预设的第二阈值时，确认用户认证通过。When the text similarity is greater than or equal to the preset second threshold, it is confirmed that the user has passed the authentication.

一种基于音视频文件的用户行为认证装置，所述装置包括：A user behavior authentication device based on audio and video files, said device comprising:

判断单元，用于判断用户上传的音视频文件的视频是否满足认证条件；A judging unit, configured to judge whether the video of the audio-video file uploaded by the user satisfies the authentication condition;

转换单元，用于在所述音视频文件的视频满足所述认证条件时，将所述音视频文件的音频转换为对应的待认证文本；A converting unit, configured to convert the audio of the audio and video file into corresponding text to be authenticated when the video of the audio and video file satisfies the authentication condition;

认证单元，用于根据所述音视频文件对应的待认证文本对用户行为进行认证。The authentication unit is configured to authenticate the user behavior according to the text to be authenticated corresponding to the audio and video file.

进一步地，所述判断单元，具体用于从所述音视频文件中提取多张图片，判断所述多张图片的背景的相似度是否大于等于第一阈值，并在所述多张图片的背景的相似度大于等于所述第一阈值时，确认所述音视频文件的视频满足所述认证条件。Further, the judging unit is specifically configured to extract a plurality of pictures from the audio and video files, judge whether the background similarity of the plurality of pictures is greater than or equal to a first threshold, and check whether the background similarity of the plurality of pictures is greater than or equal to a first threshold When the similarity degree is greater than or equal to the first threshold, it is confirmed that the video of the audio and video file satisfies the authentication condition.

进一步地，所述转换单元，具体用于根据预设的第一分割规则将所述音视频文件的音频分为N个音频片段，N为大于1的自然数；将所述音频片段中的音频转换为对应的初始文本；组合所述N个音频片段对应的N个初始文本，以得到所述音视频文件对应的待认证文本。Further, the conversion unit is specifically configured to divide the audio of the audio and video file into N audio segments according to a preset first segmentation rule, where N is a natural number greater than 1; convert the audio in the audio segment is the corresponding initial text; combine the N initial texts corresponding to the N audio segments to obtain the text to be authenticated corresponding to the audio and video file.

进一步地，所述装置还包括：Further, the device also includes:

校验单元，用于在将所述音频片段中的音频识别为对应的初始文本之后，根据预设的第二分割规则从所述音视频文件的音频中提取出N-1个校验音频片段；将所述校验音频片段中的音频转换为对应的校验文本；根据所述N-1个校验文本判断所述N个音频片段对应的N个初始文本是否准确；A verification unit, configured to extract N-1 verification audio segments from the audio of the audio and video file according to a preset second segmentation rule after identifying the audio in the audio segment as the corresponding initial text ; Convert the audio in the verification audio segment into corresponding verification text; judge whether the N initial texts corresponding to the N audio segments are accurate according to the N-1 verification texts;

所述转换单元，具体在所述N个音频片段对应的N个初始文本准确时，组合所述N个音频片段对应的N个初始文本。The conversion unit specifically combines the N initial texts corresponding to the N audio segments when the N initial texts corresponding to the N audio segments are accurate.

进一步地，所述认证单元，具体计算所述音视频文件对应的待认证文本和预设的认证文本的文本相似度，并当所述文本相似度大于等于预设的第二阈值时，确认用户认证通过。Further, the authentication unit specifically calculates the text similarity between the text to be authenticated corresponding to the audio and video file and the preset authentication text, and when the text similarity is greater than or equal to the preset second threshold, confirms that the user Certification passed.

由以上描述可以看出，本申请可以在确认用户上传的音视频文件的视频满足认证条件后，将所述音视频文件的音频转换为对应的待认证文本，并根据所述待认证文本对用户行为进行认证，从而可以基于音视频文件对用户行为进行自动认证，节省人力资源，提高认证效率。It can be seen from the above description that after confirming that the video of the audio and video file uploaded by the user satisfies the authentication conditions, the application can convert the audio of the audio and video file into the corresponding text to be authenticated, and provide the user with the text according to the text to be authenticated. Behavior authentication, so that user behavior can be automatically authenticated based on audio and video files, saving human resources and improving authentication efficiency.

附图说明Description of drawings

图1是本申请一示例性实施例中一种基于音视频文件的用户行为认证方法的流程图。Fig. 1 is a flowchart of a user behavior authentication method based on audio and video files in an exemplary embodiment of the present application.

图2是本申请一示例性实施例中另一种基于音视频文件的用户行为认证方法的流程图。Fig. 2 is a flowchart of another user behavior authentication method based on audio and video files in an exemplary embodiment of the present application.

图3是本申请一示例性实施例中一种终端结构示意图。Fig. 3 is a schematic structural diagram of a terminal in an exemplary embodiment of the present application.

图4是本申请一示例性实施例中一种基于音视频文件的用户行为认证装置的结构示意图。Fig. 4 is a schematic structural diagram of a user behavior authentication device based on audio and video files in an exemplary embodiment of the present application.

具体实施方式detailed description

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present application as recited in the appended claims.

在本申请使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

应当理解，尽管在本申请可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本申请范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."

针对上述问题，本申请提供一种基于音视频文件的用户行为的自动认证方案。In view of the above problems, the present application provides an automatic authentication scheme based on user behavior of audio and video files.

请参考图1，本申请提供一种基于音视频文件的用户行为认证方法，所述方法可以应用在终端上，包括有以下步骤：Please refer to Fig. 1, the application provides a user behavior authentication method based on audio and video files, the method can be applied on the terminal, including the following steps:

步骤101，判断用户上传的音视频文件的视频是否满足认证条件。Step 101, judging whether the video of the audio and video file uploaded by the user satisfies the authentication condition.

在本实施例中，可以从用户上传的音视频文件中随机提取出多张图片，然后判断所述多张图片的背景的相似度是否大于等于预设的第一阈值，当所述多张图片的背景的相似度大于等于所述第一阈值时，可以确认所述音视频文件的视频满足所述认证条件。In this embodiment, multiple pictures can be randomly extracted from the audio and video files uploaded by the user, and then it is judged whether the background similarity of the multiple pictures is greater than or equal to the preset first threshold. When the similarity of the background is greater than or equal to the first threshold, it can be confirmed that the video of the audio and video file satisfies the authentication condition.

步骤102，如果所述音视频文件的视频满足所述认证条件，则将所述音视频文件的音频转换为对应的待认证文本。Step 102, if the video of the audio-video file satisfies the authentication condition, convert the audio of the audio-video file into corresponding text to be authenticated.

在本实施例中，可以在所述视音频文件的视频满足所述认证条件时，将所述音视频文件的音频转换为对应的待认证文本。具体地，可以通过语音识别方法将所述音频转换为对应的待认证文本。为了提高音频转换的准确度，也可以将所述音频分为多个音频片段，然后分别将每个音频片段的音频转换为对应的初始文本，通过组合各个初始文本以得到所述音频对应的待认证文本。当然，本领域技术人员也可以采用其他方法将所述音视频文件的音频转换为对应的待认证文本，本申请对此不作特殊限制。In this embodiment, when the video of the video and audio file satisfies the authentication condition, the audio of the audio and video file may be converted into the corresponding text to be authenticated. Specifically, the audio may be converted into corresponding text to be authenticated by a voice recognition method. In order to improve the accuracy of audio conversion, the audio can also be divided into a plurality of audio segments, and then the audio of each audio segment is converted into a corresponding initial text, and the audio corresponding to the audio is obtained by combining each initial text. Authentication text. Of course, those skilled in the art can also use other methods to convert the audio of the audio-video file into the corresponding text to be authenticated, which is not particularly limited in this application.

步骤103，根据所述音视频文件对应的待认证文本对用户行为进行认证。Step 103: Authenticate the user behavior according to the text to be authenticated corresponding to the audio/video file.

在本实施例中，可以将所述音视频文件对应的待认证文本与预设的认证文本进行匹配，比如：计算所述待认证文本和认证文本的文本相似度，当所述文本相似度大于等于第二阈值时，确认用户认证通过。In this embodiment, the text to be authenticated corresponding to the audio and video file can be matched with the preset authentication text, such as: calculating the text similarity between the text to be authenticated and the authentication text, when the text similarity is greater than When it is equal to the second threshold, it is confirmed that the user authentication is passed.

请参考图2，本申请提供另一种基于音视频文件的用户行为认证方法，所述方法可以应用在终端上，包括有以下步骤：Please refer to Figure 2, this application provides another user behavior authentication method based on audio and video files, the method can be applied to the terminal, including the following steps:

步骤201，从用户上传的音视频文件中提取多张图片。Step 201, extract multiple pictures from the audio and video files uploaded by the user.

在本实施例中，当服务提供方要认证用户是否已经清楚了解业务操作的规定时，可以让用户上传其朗读业务操作规定内容的音视频文件，以提供认证的基础。通常来讲，如果用户上传的音视频文件是用户朗读业务操作规定的音视频文件，则所述音视频文件的拍摄背景不变，或者变化很小。因此，可以先对所述音视频文件的背景进行判断。In this embodiment, when the service provider wants to verify whether the user has clearly understood the regulations of the business operation, the user can upload an audio and video file that reads the contents of the business operation regulations aloud, so as to provide the basis for authentication. Generally speaking, if the audio and video file uploaded by the user is an audio and video file specified by the user to read aloud the business operation, the shooting background of the audio and video file remains unchanged or changes little. Therefore, the background of the audio and video files can be judged first.

具体地，在本步骤中，可以先从所述音视频文件中提取出多张图片。比如：可以通过随机算法从所述音视频文件的视频帧中提取出多张图片。提取出的图片的数量可以由开发人员进行设置，本申请对此不作特殊限制。Specifically, in this step, multiple pictures may be first extracted from the audio and video files. For example: multiple pictures may be extracted from the video frames of the audio and video files through a random algorithm. The number of pictures to be extracted can be set by the developer, and this application does not impose any special limitation on this.

步骤202，判断所述多张图片的背景的相似度是否大于等于第一阈值，如果所述多张图片的背景的相似度大于等于第一阈值，则执行步骤203。Step 202, judging whether the similarity of the backgrounds of the multiple pictures is greater than or equal to a first threshold, and if the similarity of the backgrounds of the multiple pictures is greater than or equal to the first threshold, perform step 203.

基于前述步骤201，在提取出多张图片后，针对每张图片，可以先进行人脸识别，然后将所述图片中的人脸去除后，得到所述图片的背景。在得到所述多张图片的多个背景后，计算所述多张图片的背景的相似度，比如：可以通过SIFT(Scale-invariant featuretransform，尺度不变特征转换)算法、MD5(message-digest algorithm5，信息摘要)算法等算法计算所述多张图片的背景的相似度，本申请对此不作特殊限制。Based on the aforementioned step 201, after multiple pictures are extracted, face recognition may be performed on each picture, and then the face in the picture is removed to obtain the background of the picture. After obtaining the multiple backgrounds of the multiple pictures, calculate the similarity of the backgrounds of the multiple pictures, such as: SIFT (Scale-invariant feature transform, scale-invariant feature transformation) algorithm, MD5 (message-digest algorithm5 , Information Digest) algorithm and other algorithms to calculate the similarity of the backgrounds of the multiple pictures, which is not particularly limited in the present application.

具体地，在本实施例中，针对所述多张图片，可以计算任意两张图片的背景的相似度，比如：图片a和图片b的相似度S_ab，然后计算得到的多个相似度S_ab的平均值，将所述平均值作为所述多张图片的背景的相似度。Specifically, in this embodiment, for the plurality of pictures, the similarity of the background of any two pictures can be calculated, such as: the similarity S _ab of picture a and picture b, and then the calculated multiple similarities S The average value of _ab , using the average value as the similarity of the background of the plurality of pictures.

在本步骤中，判断所述多张图片的背景的相似度是否大于等于所述第一阈值。其中，所述第一阈值可以由开发人员进行设置，本申请对此不作特殊限制。如果所述多张图片的背景的相似度大于等于所述第一阈值，则可以确认所述音视频文件的拍摄背景不变或者变化很小，所述音视频文件满足所述认证条件，执行步骤203。如果所述多张图片的背景的相似度小于所述第一阈值，则可以重新执行步骤201，重新提取多张图片进行判断。当达到预设的重新提取次数时，如果提取的多张图片的背景的相似度仍然小于所述第一阈值，则可以确认所述音视频文件的视频不满足认证条件，比如：用户随便找了一个音视频文件上传。在本实施例中，可以返回认证失败的消息。In this step, it is judged whether the similarity of the backgrounds of the plurality of pictures is greater than or equal to the first threshold. Wherein, the first threshold may be set by a developer, and this application does not make any special limitation on this. If the similarity of the backgrounds of the multiple pictures is greater than or equal to the first threshold, it can be confirmed that the shooting background of the audio and video files does not change or changes very little, and the audio and video files meet the authentication conditions, and the steps are executed 203. If the similarity of the backgrounds of the multiple pictures is smaller than the first threshold, step 201 may be re-executed to re-extract multiple pictures for determination. When the preset number of re-extractions is reached, if the similarity of the backgrounds of the extracted pictures is still less than the first threshold, it can be confirmed that the video of the audio and video files does not meet the authentication conditions, such as: the user casually finds An audio and video file is uploaded. In this embodiment, an authentication failure message may be returned.

步骤203，确认所述音视频文件的视频满足所述认证条件。Step 203, confirming that the video of the audio/video file satisfies the authentication condition.

基于前述步骤202的判断结果，在所述多张图片的背景的相似度大于等于第一阈值时，可以确认所述音视频文件的视频满足所述认证条件，执行步骤204。Based on the judgment result of step 202, when the background similarity of the multiple pictures is greater than or equal to the first threshold, it can be confirmed that the video of the audio/video file satisfies the authentication condition, and step 204 is executed.

可选的，在本申请另一实施例中，在从用户上传的音视频文件中提取出多张图片后，也可以针对每张图片，进行人脸识别，然后判断所述多张图片的人脸相似度，当所述多张图片的人脸相似度大于等于预设的某一阈值后，确认所述音视频文件的视频满足所述认证条件，执行步骤204。Optionally, in another embodiment of the present application, after extracting multiple pictures from the audio and video files uploaded by the user, it is also possible to perform face recognition on each picture, and then determine the people of the multiple pictures. Face similarity, when the face similarity of the multiple pictures is greater than or equal to a certain preset threshold, it is confirmed that the video of the audio and video file satisfies the authentication condition, and step 204 is executed.

步骤204，根据预设的第一分割规则将所述音视频文件的音频分为N个音频片段。Step 204: Divide the audio of the audio and video file into N audio segments according to a preset first division rule.

在本实施例中，在确认所述音视频文件的视频满足所述认证条件时，将所述音视频文件的音频转换为对应的待认证文本，以完成认证过程。在实际实现中，针对所述音视频文件的音频，可以先进行去噪处理，比如：可以通过分析频谱剔除所述音频中的干扰，然后再根据预设的第一分割规则将所述音视频文件的音频分为N个音频片段。In this embodiment, when it is confirmed that the video of the audio-video file satisfies the authentication condition, the audio of the audio-video file is converted into the corresponding text to be authenticated, so as to complete the authentication process. In actual implementation, for the audio of the audio and video file, denoising processing can be performed first, such as: the interference in the audio can be eliminated by analyzing the frequency spectrum, and then the audio and video can be divided according to the preset first segmentation rule The audio of the file is divided into N audio segments.

具体地，在本实施例中，可以根据预设的第一分割规则将所述音视频文件的音频分为N个音频片段，其中，所述N的取值为大于1的自然数，所述第一分割规则可以由开发人员进行设置，比如：可以按照时间的顺序将所述音视频文件的音频分为时长为5秒的音频片段。Specifically, in this embodiment, the audio of the audio and video file can be divided into N audio segments according to a preset first segmentation rule, wherein the value of N is a natural number greater than 1, and the first A division rule can be set by the developer, for example: the audio of the audio/video file can be divided into audio segments with a duration of 5 seconds according to the order of time.

步骤205，将所述音频片段中的音频转换为对应的初始文本。Step 205, converting the audio in the audio segment into corresponding initial text.

基于前述步骤204，在得到所述音视频片段中的音频的N个音频判断之后，针对每个音频片段，可以根据相关技术中提供的语音识别方法将所述音频片段转换为对应的文本，在本实施例中，将所述音频片段对应的文本称为初始文本。Based on the aforementioned step 204, after obtaining the N audio judgments of the audio in the audio and video clips, for each audio clip, the audio clip can be converted into a corresponding text according to the speech recognition method provided in the related art. In this embodiment, the text corresponding to the audio segment is called the initial text.

步骤206，组合所述N个音频片段对应的N个初始文本，以得到所述音视频文件对应的待认证文本。Step 206, combine the N initial texts corresponding to the N audio clips to obtain the text to be authenticated corresponding to the audio and video files.

基于前述步骤205，在得到所述N个音频片段对应的N个初始文本后，在本步骤中，按照时间先后的顺序组合所述N个初始文件，以得到所述音视频文件对应的待认证文本。Based on the aforementioned step 205, after obtaining the N initial texts corresponding to the N audio clips, in this step, the N initial files are combined in chronological order to obtain the corresponding audio and video files to be authenticated. text.

进一步地，在本实施例中，为了提高所述待认证文本的准确性，在步骤205将所述音频片段中的音频识别为对应的初始文本之后，执行对所述初始文本准确性的校验流程。比如：可以根据预设的第二分割规则从所述音视频文件的音频中提取出N-1个校验音频片段，然后将所述校验音频片段中的音频转换为对应的校验文本，并根据所述N-1个校验文本判断所述N个音频片段对应的N个初始文本是否准确，如果准确，则组合所述N个音频片段对应的N个初始文本，以得到所述音视频文件对应的待认证文本。所述第二分割规则也可以由开发人员进行设置，举例来说，根据所述第二分割规则提取出的N-1个校验音频片段中的每个校验音频片段与根据所述第一分割规则划分的两个连续的音频片段都有交集。假设，根据所述第一分割规则将用户上传的音视频文件的音频分为三个音频片段，其中，第一音频片段的时间信息为所述音频的0-5秒，第二音频片段的时间信息为所述音频的5-10秒，第三音频片段的时间信息为所述音频的10-15秒。则可以根据所述第二分割规则提取出的2个校验音频片段，其中，第一校验音频片段的时间信息可以为所述音频的4-7秒，所述第一校验音频片段与所述第一音频片段和所述第二音频片段都有交集，第二校验音频片段的时间信息为所述音频的9-12秒，所述第二校验音频片段与所述第二音频片段和所述第三音频片段都有交集。Further, in this embodiment, in order to improve the accuracy of the text to be authenticated, after the audio in the audio clip is identified as the corresponding initial text in step 205, a verification of the accuracy of the initial text is performed process. For example: N-1 verification audio segments can be extracted from the audio of the audio-video file according to the preset second segmentation rule, and then the audio in the verification audio segments is converted into a corresponding verification text, And judge whether the N initial texts corresponding to the N audio segments are accurate according to the N-1 verification texts, and if accurate, combine the N initial texts corresponding to the N audio segments to obtain the audio The text to be authenticated corresponding to the video file. The second segmentation rule can also be set by the developer. For example, each of the N-1 verification audio segments extracted according to the second segmentation rule is the same as that of each verification audio segment according to the first Two consecutive audio segments divided by a segmentation rule have an intersection. Suppose, according to the first segmentation rule, the audio of the audio and video file uploaded by the user is divided into three audio segments, wherein the time information of the first audio segment is 0-5 seconds of the audio, and the time information of the second audio segment The information is 5-10 seconds of the audio, and the time information of the third audio segment is 10-15 seconds of the audio. Then the 2 verification audio segments that can be extracted according to the second segmentation rule, wherein, the time information of the first verification audio segment can be 4-7 seconds of the audio, and the first verification audio segment and The first audio clip and the second audio clip have an intersection, and the time information of the second verification audio clip is 9-12 seconds of the audio, and the second verification audio clip and the second audio clip have an intersection. segment and said third audio segment have an intersection.

在本实施例中，在提取出所述N-1个校验音频片段后，将所述校验音频片段中的音频转换为对应的校验文本，然后判断所述校验音频片段对应的校验文本是否包含在与该校验音频片段有交集的音频片段对应的初始文本中，如果是，则可以确认所述N个音频片段对应的N个初始文本准确。仍以前述假设为例，如果前述第一音频片段对应的初始文本为“申请人已阅读”，前述第二音频片段对应的初始文本为“上述规则规定”，前述第三音频片段对应的初始文本为“2015年2月1日”，而前述第一校验音频片段对应的校验文本为“阅读上述”，前述第二校验音频片段对应的校验文本为“规定，2015年”，则所述第一校验音频片段对应的校验文本“阅读上述”包含在所述第一音频片段对应的初始文本“申请人已阅读”和所述第二音频片段对应的初始文本“上述规则规定”中，所述第二校验音频片段对应的校验文本“规定，2015年”包含在所述第二音频片段对应的初始文本“上述规则规定”和所述第三音频片段对应的初始文本“2015年2月1日”中，可以确认前述3个音频片段对应的初始文本准确。如果所述校验音频片段对应的校验文本没有包含在该校验音频片段有交集的音频片段对应的初始文本中，则可以确认所述N个音频片段对应的N个初始文本不准确。In this embodiment, after extracting the N-1 verification audio segments, the audio in the verification audio segments is converted into corresponding verification text, and then the verification text corresponding to the verification audio segments is determined. Whether the verification text is included in the initial text corresponding to the audio segment intersecting with the verification audio segment, if yes, it can be confirmed that the N initial texts corresponding to the N audio segments are accurate. Still taking the aforementioned assumptions as an example, if the initial text corresponding to the aforementioned first audio segment is "the applicant has read", the initial text corresponding to the aforementioned second audio segment is "the above rules stipulate", and the initial text corresponding to the aforementioned third audio segment is "February 1, 2015", and the verification text corresponding to the aforementioned first verification audio segment is "read the above", and the verification text corresponding to the aforementioned second verification audio segment is "stipulation, 2015", then The verification text "read the above" corresponding to the first verification audio segment contains the initial text "the applicant has read" corresponding to the first audio segment and the initial text "the above rule stipulates" corresponding to the second audio segment In ", the verification text corresponding to the second verification audio segment "stipulations, 2015" contains the initial text corresponding to the second audio segment "the above rules stipulate" and the initial text corresponding to the third audio segment In "February 1, 2015", it can be confirmed that the initial text corresponding to the aforementioned three audio clips is accurate. If the verification text corresponding to the verification audio segment is not included in the initial text corresponding to the audio segment that the verification audio segment intersects with, it can be confirmed that the N initial texts corresponding to the N audio segments are inaccurate.

在本实施例中，当确认所述N个音频片段对应的N个初始文本不准确时，可以根据预设的第三分割规则重新提取N-1个校验音频片段进行校验，如果校验结果仍然是不准确，则可以根据这N-1个校验音频片段对应的校验文本对所述初始文本进行修正，也可以根据预设的第四分割规则重新将用户上传的音视频文件的音频分割为M个音频片段，然后执行步骤205以及后续的校验流程，本申请对此不作特殊限制。In this embodiment, when it is confirmed that the N initial texts corresponding to the N audio segments are inaccurate, N-1 verification audio segments can be re-extracted for verification according to the preset third segmentation rule. The result is still inaccurate, then the initial text can be amended according to the verification text corresponding to the N-1 verification audio segments, or the audio and video files uploaded by the user can be resubmitted according to the preset fourth segmentation rule. The audio is divided into M audio clips, and then step 205 and the subsequent verification process are executed, which is not particularly limited in this application.

步骤207，计算所述音视频文件对应的待认证文本和预设的认证文本的文本相似度。Step 207, calculating the text similarity between the text to be authenticated corresponding to the audio and video file and the preset authentication text.

基于前述步骤206，在得到所述音视频文件对应的待认证文本后，计算所述待认证文本和预设的认证文本的文本相似度。具体地，在本步骤中，可以根据相关技术中提供的文本相似度算法来计算所述待认证文本和预设的认证文本的文本相似度，本申请在此不再一一赘述。Based on the aforementioned step 206, after the text to be authenticated corresponding to the audio and video file is obtained, the text similarity between the text to be authenticated and the preset authentication text is calculated. Specifically, in this step, the text similarity between the text to be authenticated and the preset authentication text can be calculated according to the text similarity algorithm provided in the related art, and the present application will not repeat them here.

步骤208，当所述文本相似度大于等于预设的第二阈值时，确认用户认证通过。Step 208, when the text similarity is greater than or equal to the preset second threshold, confirm that the user has passed the authentication.

基于前述步骤207，在计算得到所述待认证文本和预设的认证文本的文本相似度后，判断该文本相似度是否大于等于第二阈值，所述第二阈值可以由开发人员进行设置。如果所述文本相似度大于等于预设的第二阈值时，则可以确认用户认证通过，如果所述文本相似度小于所述第二阈值，则可以确认用户认证失败。Based on the aforementioned step 207, after calculating the text similarity between the text to be authenticated and the preset authentication text, it is judged whether the text similarity is greater than or equal to a second threshold, and the second threshold can be set by the developer. If the text similarity is greater than or equal to a preset second threshold, it can be confirmed that the user authentication has passed, and if the text similarity is smaller than the second threshold, it can be confirmed that the user authentication has failed.

进一步地，在本申请另一可选的实施例中，还可以设置有第三阈值，所述第三阈值小于所述第二阈值。当所述文本相似度小于所述第二阈值时，还可以判断所述文本相似度是否大于等于第三阈值，如果所述文本相似度大于等于第三阈值，则可以向管理员输出提示，以提醒管理人员基于该音视频文件进行用户行为的人工认证。Further, in another optional embodiment of the present application, a third threshold may also be set, and the third threshold is smaller than the second threshold. When the text similarity is less than the second threshold, it can also be judged whether the text similarity is greater than or equal to the third threshold, and if the text similarity is greater than or equal to the third threshold, a prompt can be output to the administrator to Remind management personnel to perform manual verification of user behavior based on the audio and video files.

与前述基于音视频文件的用户行为认证方法实施例相对应，本公开还提供了基于音视频文件的用户行为认证装置的实施例。Corresponding to the aforementioned embodiments of the audio-video file-based user behavior authentication method, the present disclosure also provides an embodiment of an audio-video file-based user behavior authentication device.

与本申请基于音视频文件的用户行为认证方法的实施例相对应，本申请还提供一种基于音视频文件的用户行为认证装置。本申请所述的装置可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。以软件实现为例，本申请基于音视频文件的用户行为认证装置作为一个逻辑意义上的装置，是通过其所在设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。Corresponding to the embodiments of the user behavior authentication method based on audio and video files in this application, this application also provides a user behavior authentication device based on audio and video files. The devices described in this application can be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, the user behavior authentication device based on audio and video files in this application is a logical device, which reads the corresponding computer program instructions in the non-volatile memory into the memory through the processor of the device where it is located formed by running.

请参考图3和图4，本申请提供一种基于音视频文件的用户行为认证装置300，所述装置300包括有：判断单元301、转换单元302、认证单元303以及校验单元304。Please refer to FIG. 3 and FIG. 4 , the present application provides a user behavior authentication device 300 based on audio and video files.

其中，所述判断单元301，用于判断用户上传的音视频文件的视频是否满足认证条件；Wherein, the judging unit 301 is used to judge whether the video of the audio-video file uploaded by the user satisfies the authentication condition;

所述转换单元302，用于在所述音视频文件的视频满足所述认证条件时，将所述音视频文件的音频转换为对应的待认证文本；The converting unit 302 is configured to convert the audio of the audio and video file into corresponding text to be authenticated when the video of the audio and video file satisfies the authentication condition;

所述认证单元303，用于根据所述音视频文件对应的待认证文本对用户行为进行认证。The authentication unit 303 is configured to authenticate user actions according to the text to be authenticated corresponding to the audio and video files.

进一步地，所述判断单元301，具体用于从所述音视频文件中提取多张图片，判断所述多张图片的背景的相似度是否大于等于第一阈值，并在所述多张图片的背景的相似度大于等于所述第一阈值时，确认所述音视频文件的视频满足所述认证条件。Further, the judging unit 301 is specifically configured to extract a plurality of pictures from the audio and video files, judge whether the background similarity of the plurality of pictures is greater than or equal to a first threshold, and determine whether When the similarity of the background is greater than or equal to the first threshold, it is confirmed that the video of the audio and video file satisfies the authentication condition.

进一步地，所述转换单元302，具体用于根据预设的第一分割规则将所述音视频文件的音频分为N个音频片段，N为大于1的自然数；将所述音频片段中的音频转换为对应的初始文本；组合所述N个音频片段对应的N个初始文本，以得到所述音视频文件对应的待认证文本。Further, the conversion unit 302 is specifically configured to divide the audio of the audio and video file into N audio segments according to a preset first segmentation rule, where N is a natural number greater than 1; the audio in the audio segment Converting to corresponding initial texts; combining the N initial texts corresponding to the N audio clips to obtain the text to be authenticated corresponding to the audio and video files.

所述校验单元304，用于在将所述音频片段中的音频识别为对应的初始文本之后，根据预设的第二分割规则从所述音视频文件的音频中提取出N-1个校验音频片段；将所述校验音频片段中的音频转换为对应的校验文本；根据所述N-1个校验文本判断所述N个音频片段对应的N个初始文本是否准确；The checking unit 304 is configured to extract N-1 checksums from the audio of the audio and video file according to the preset second segmentation rule after identifying the audio in the audio clip as the corresponding initial text. Check the audio clip; convert the audio in the verification audio clip into a corresponding verification text; judge whether the N initial texts corresponding to the N audio clips are accurate according to the N-1 verification texts;

进一步地，所述认证单元303，具体计算所述音视频文件对应的待认证文本和预设的认证文本的文本相似度，并当所述文本相似度大于等于预设的第二阈值时，确认用户认证通过。Further, the authentication unit 303 specifically calculates the text similarity between the text to be authenticated corresponding to the audio and video file and the preset authentication text, and when the text similarity is greater than or equal to the preset second threshold, confirms User authentication passed.

上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程，在此不再赘述。For the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method for details, and will not be repeated here.

以上所述仅为本申请的较佳实施例而已，并不用以限制本申请，凡在本申请的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本申请保护的范围之内。The above is only a preferred embodiment of the application, and is not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the application should be included in the application. within the scope of protection.

Claims

1. a kind of user behavior authentication method based on audio-video document, it is characterised in that methods described includes：

Judge whether the video for the audio-video document that user uploads meets authentication condition；

If the video of the audio-video document meets the authentication condition, the audio of the audio-video document is converted to pair The text to be certified answered；

Text to be certified is authenticated to user behavior according to corresponding to the audio-video document；

Whether the video of the audio-video document for judging user and uploading meets authentication condition, including：

Plurality of pictures is extracted from the audio-video document；

Judge whether the similarity of the background of the plurality of pictures is more than or equal to first threshold；

If the similarity of the background of the plurality of pictures is more than or equal to the first threshold, the audio-video document is confirmed Video meets the authentication condition.

2. according to the method for claim 1, it is characterised in that the audio by the audio-video document is converted to correspondingly Text to be certified, including：

The audio of the audio-video document is divided into by N number of audio fragment according to the default first segmentation rule, N is oneself more than 1 So number；

Audio in the audio fragment is converted into corresponding original text；

N number of original text corresponding to N number of audio fragment is combined, to obtain text to be certified corresponding to the audio-video document This.

3. according to the method for claim 2, it is characterised in that be corresponding by the audio identification in the audio fragment After original text, methods described also includes：

N-1 verification audio fragment is extracted from the audio of the audio-video document according to the default second segmentation rule；

Audio in the verification audio fragment is converted into corresponding verification text；

Whether N number of original text according to corresponding to described N-1 verification text judges N number of audio fragment is accurate；

If N number of original text corresponding to N number of audio fragment is accurate, combine N number of first corresponding to N number of audio fragment Beginning text.

4. according to the method for claim 1, it is characterised in that the text to be certified according to corresponding to the audio-video document This is authenticated to user behavior, including：

Calculate the text similarity of text to be certified corresponding to the audio-video document and default certification text；

When the text similarity is more than or equal to default Second Threshold, confirm that user authentication passes through.

5. a kind of user behavior authentication device based on audio-video document, it is characterised in that described device includes：

Whether judging unit, the video of the audio-video document uploaded for judging user meet authentication condition；

Converting unit, for when the video of the audio-video document meets the authentication condition, by the audio-video document Audio is converted to corresponding text to be certified；

Authentication unit, user behavior is authenticated for the text to be certified according to corresponding to the audio-video document；

The judging unit, specifically for extracting plurality of pictures from the audio-video document, judge the back of the body of the plurality of pictures Whether the similarity of scape is more than or equal to first threshold, and the similarity of the background in the plurality of pictures is more than or equal to described first During threshold value, confirm that the video of the audio-video document meets the authentication condition.

6. device according to claim 5, it is characterised in that

The converting unit, it is N number of specifically for being divided into the audio of the audio-video document according to the default first segmentation rule Audio fragment, N are the natural number more than 1；Audio in the audio fragment is converted into corresponding original text；Described in combination N number of original text corresponding to N number of audio fragment, to obtain text to be certified corresponding to the audio-video document.

7. device according to claim 6, it is characterised in that described device also includes：

Verification unit, for after by the audio identification in the audio fragment being corresponding original text, according to default Second segmentation rule extracts N-1 verification audio fragment from the audio of the audio-video document；By the verification audio piece Audio in section verifies text corresponding to being converted to；According to corresponding to described N-1 verification text judges N number of audio fragment Whether N number of original text is accurate；

The converting unit, specifically when N number of original text is accurate corresponding to N number of audio fragment, combine N number of audio N number of original text corresponding to fragment.

8. device according to claim 5, it is characterised in that

The authentication unit, specifically calculate the text of text to be certified corresponding to the audio-video document and default certification text Similarity, and when the text similarity is more than or equal to default Second Threshold, confirm that user authentication passes through.