Detailed Description
      Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
      It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
      Fig. 1 is a schematic view of an application scenario of the method for determining a sound object provided in the present application. For example, fig. 1 shows a conference room scenario. A conference room includes a plurality of people participating in a conference. Only 3 participants, namely participant a, participant B and participant C, are schematically shown in fig. 1, and the number of participants is not limited. In order to facilitate the subsequent rapid extraction of the voice content in the conference process, the voices of a plurality of participants in the whole conference process can be separated in roles.
      In the process of sending out the audio signal by the participant, the audio signal can be collected by the audio collector preset in the conference room. After the audio collector obtains the audio signals sent by the participants, the audio signal sent by the participants collected by the audio collector can be subjected to framing processing according to the preset time length, and then the audio frames sent by the participants can be obtained.
      In the embodiment of the application, the audio frame currently acquired by the audio acquisition device is used as the target audio frame.
      It should be noted that, in the embodiment of the present application, in order to achieve accurate determination of the sound emission object, it is also necessary to acquire position information of the sound emission object from which the target audio frame is emitted. For each target audio frame, positional information of a sound-emitting object from which the target audio frame is emitted may be determined by a sound source localization technique. For example, the position information of the sound-generating object that generates the target audio frame may be acquired by using an audio collector installed in the conference room in advance.
      The sound source localization technique is a technique of acquiring position information of a sound source. In some embodiments, the sound source position information may be relative position information between the sound-emitting object and a preset audio collector. For example, the sound source position information may include an angle between the sound-emitting object and a preset audio collector.
      In some embodiments, the preset audio collector may be a microphone array, and the sound source localization technology may be a microphone array sound source localization. The microphone array is composed of several to thousands of microphones which are arranged according to a certain rule. After the audio signals are received by the microphone array, a time delay estimation method is adopted to position the sound source. Specifically, the audio signals are received through the microphone array, the time delay of the audio signals received by each microphone relative to the audio signals received by the reference point is calculated, and the sound source is positioned according to the calculated time delay.
      For example, participant B speaks during the time period 0-t 1 of the conference. Participant a speaks during the time period t 1-t 2 of the conference, and participant C speaks during the time period t 2-t 3. The time t3 is later than the time t2, and the time t2 is later than the time t 1. And the audio collector in the conference room collects the audio frames sent by the participants in the conference room in real time.
      In the embodiment of the application, when the audio collector in the conference room acquires the 1 st audio frame and emits the sound of the 1 st audio framePosition information D of object1Then, the 1 st audio frame is used as the starting point of the first sounding object sounding content.
      Then, the audio collector continuously acquires the 2 nd audio frame and the position information D of the sounding object which sends the 2 nd audio frame2. When the target audio frame is the 2 nd audio frame, the first target audio segment includes the 1 st audio frame. Then the position information D of the 2 nd audio frame2Position information D with 1 st audio frame1And (6) matching. If the position information D2And position information D1If the audio frames are matched with each other, the sound production objects of the 1 st audio frame and the 2 nd audio frame can be determined to be the first sound production objects, and the position information D of the 3 rd audio frame and the sound production object which emits the 3 rd audio frame is continuously acquired3。
      When the 3 rd audio frame is a target audio frame, for example, the first target audio segment may include a 1 st audio frame and a 2 nd audio frame. Since the 1 st audio frame and the 2 nd audio frame have the same sound-generating object, the second target audio segment may include the 1 st audio frame and the 2 nd audio frame.
      Then, the position information D of the 3 rd audio frame is determined3First position information corresponding to the first target audio segment is matched. For example, the first position information corresponding to the first target audio segment may be an average value D' of the position information of the sound-emitting object that emitted the 1 st audio frame and the position information of the sound-emitting object that emitted the 2 nd audio frame. If the position information D of the 3 rd audio frame3And matching with the position information D', determining that the sound production objects of the 1 st audio frame to the 3 rd audio frame are all the first sound production objects.
      By analogy, it can be determined which audio frame's sound object is the first sound object according to the above method. Assuming that the sound emission objects of the 1 st to M1 th audio frames are all the first sound emission objects according to the above method, the M1+1 th audio frame, i.e. the target audio frame, and the target position information D of the sound emission object emitting the target audio frame are continuously acquiredM1+1。
      When the target audio frame is the M1+1 th audio frame, the first target audio segment may include the first N audio frames of the target audio frame, where N is a positive integer greater than 1 or equal to 1. The first target audio segment may include audio frames from the M1-nth audio frame to the M1 th audio frame. Wherein M1 is a positive integer. It should be noted that, if the number of audio frames acquired before the target audio frame is less than N, the first target audio segment includes all audio frames before the target audio frame.
      When the target audio frame is the M1+1 th audio frame, the second target audio segment may be all audio frames or a portion of audio frames emitted by the sound-generating object of the first target audio segment. And the second target audio segment comprises the first target audio segment. For example, the second target audio segment includes all audio frames from the 1 st audio frame to the M1 th audio frame.
      As an example, M1-1000 and N-500. If the 1001 st audio frame is a target audio frame, the first target audio segment includes audio frames from the 500 th audio frame to the 1000 th audio frame, and the second target audio segment includes audio frames from the 1 st audio frame to the 1000 th audio frame.
      For example, the first position information corresponding to the first target audio segment may be an average value D ″ of position information of the sounding object of each audio frame between the 500 th audio frame and the 1000 th audio frame. Next, the position information D of the participant who sent out the 1001 st audio frame1001Matching with the position information D ″. If the position information D1001If the position information D ″ does not match, it may be determined that the sound object from which the 1001 st audio frame is emitted is different from the first sound object from which the 1 st to 1000 th audio frames are emitted, that is, it represents that the speaker in the conference room has switched. The 1000 th audio frame may be used as the end point of the utterance of the first utterance object and the 1001 st audio frame may be used as the start point of the utterance of the second utterance object.
      Next, audio frames from the start point to the end point of the utterance content of the first utterance object, i.e., the 1 st audio frame to the 1000 th audio frame, i.e., the second target audio segment, may be extracted. And then, extracting the target voiceprint characteristics of the second target audio segment, and determining the target sound-emitting objects corresponding to the 1 st audio frame to the 1000 th audio frame, namely the first sound-emitting object, based on the target voiceprint characteristics. Because the participant B speaks first, the sound-producing objects corresponding to the 1 st to 1000 th audio frames can be determined as the participant B according to the target voiceprint features of the 1 st to 1000 th audio frames.
      The voice print recognition technology is utilized to recognize the voice production object by utilizing the voice print characteristics. The voiceprint recognition technology is a technology for recognizing the identity of a sound-producing object through the voiceprint characteristics of the sound-producing object. Voiceprints refer to the spectrum of sound waves carrying verbal information.
      Then, the next target audio frame is continuously obtained, and according to the method, the end point of the second sound-producing object can be obtained. For example, if all the sound-generating objects of the 1001 st to 2000 th audio frames are the second sound-generating objects, and it is determined that the target position information of the sound-generating object from which the 2001 th audio frame is generated does not match the average value of the position information of the sound-generating object from which each of the 1500 th to 2000 th audio frames is generated, the 2000 th audio frame is the end point of the sound-generating content of the second sound-generating object. By extracting the target voiceprint features of the 1001 st audio frame to the 2000 th audio frame, the target sound-emitting objects of the 1001 st audio frame to the 2000 th audio frame can be determined according to the voiceprint features. Since participant a is the second speaking person, it can be determined that the sound-emitting objects of the 1001 st audio frame to the 2000 th audio frame are participant a. Similarly, the audio frame emitted by participant C may also be determined.
      In the embodiment of the application, by combining the voiceprint recognition technology and the sound source positioning technology, the sound production object corresponding to the sound production content can be accurately determined.
      It should be noted that the method for determining an audio object provided in the embodiment of the present specification may be applied to other scenes, such as different scenes, for example, an audition scene, an interview scene, a classroom scene, and the like, besides the scene for determining an audio object in the meeting room, and only the application to the meeting room scene is taken as an example for description here.
      Based on the above application scenarios, the following describes in detail a method for determining a sound object according to an embodiment of the present application with reference to fig. 2.
      Fig. 2 is a flow chart of a method for determining an utterance object according to a first aspect of the present application.
      As shown in fig. 2, a method 200 for determining an utterance object provided in an embodiment of the present application includes:
       step 210, acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
       step 220, if the target position information is determined not to match the first position information corresponding to the first target audio segment, extracting target voiceprint characteristics of a second target audio segment, wherein the first target audio segment comprises the first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment; n is an integer greater than or equal to 1;
       step 230, determining a target sound-emitting object of the second target audio segment according to the target voiceprint characteristics.
      A specific implementation of step 210 will be described first.
      In an embodiment of the present application, the voice data stream comprises a series of sample point values having a time sequence. The sample point values are obtained by sampling the original analog sound signal at a particular audio sample rate. A series of sample point values may describe the sound. The audio sampling rate is the number of samples taken in hertz (Hz) within one second. The higher the audio sampling rate, the higher the frequency of the sound waves that can be described. Wherein the audio frame comprises a fixed number of sample point values having a time sequence.
      After the sound-producing object sends out the audio signal, the preset audio collector can collect the audio signal. After the audio signal sent by the sounding object is acquired from the audio collector, the target audio signal collected by the audio collector can be subjected to framing processing according to the preset duration, the target audio signal is divided into a plurality of frames of audio signals, and each audio frame sent by the sounding object can be obtained. In the embodiment of the application, the current audio frame acquired by the audio collector is determined as the target audio frame.
      In the embodiment of the present application, the position information of the sound emission object that emits the target audio frame can be acquired by using the sound source localization technology. For the description of the sound source localization technique, reference is made to the above description, and the description is not repeated here.
      As an example, the audio collector is a microphone array, and when the sound-producing object emits an audio signal, the audio signal can be collected by a microphone in the microphone array. In order to perform real-time processing on the audio signals, after the audio signals acquired by each microphone array are acquired, the target audio signals acquired by each microphone array may be subjected to framing processing according to a preset time duration. And then using the currently obtained audio frame as a target audio frame. Because the distances between each microphone in the microphone array and the sound-producing object are generally different, the time for each microphone in the microphone array to receive the target audio frame is also different, and the target position information of the sound-producing object which sends the target audio frame can be calculated according to the time difference for each microphone to receive the corresponding target audio frame.
      The specific implementation of step 220 is described below.
      In the embodiment of the present application, when the sound-generating object is switched from the sound-generating object a to the sound-generating object B, the position information of the sound-generating object a and the position information of the sound-generating object B are different from each other because the positions of the sound-generating object a and the sound-generating object B are different from each other. In order to accurately determine whether or not the sound-emitting object is switched, it is necessary to determine whether or not the target position information of the second sound-emitting object from which the target audio frame is emitted matches the position information of the first sound-emitting object from which the previous audio frame of the target audio frame is emitted.
      Since the audio frame emitted by the first sound emission object emitting the previous audio frame of the target audio frame may include multiple frames, in order to improve the accuracy of determining whether the sound emission object is switched, the first position information corresponding to the first target audio segment including the previous N audio frames of the target audio frame may be matched with the target position information.
      The first target audio segment comprises the first N audio frames of the target audio frame, and the sound production object of each audio frame in the first target audio segment is the same. That is, the sound-generating object of the first target audio segment is the first sound-generating object of the audio frame preceding the emission target audio frame. In other words, the sound production object corresponding to each audio frame in the first target audio segment is the same, i.e. the first sound production object.
      It should be noted that, if the number of audio frames emitted by the first sound-emitting object acquired before the target audio frame is less than N, all consecutive audio frames emitted by the first sound-emitting object are taken as the first target audio segment, and the end point of the first target audio segment is the frame before the target audio frame.
      In some embodiments, the first position information corresponding to the first target audio segment is determined based on position information of a sound-generating object corresponding to an audio frame in the first target audio segment. For example, the first position information is determined based on an average of position information of the sound-generating object corresponding to each audio frame in the first target audio segment.
      For example, the position information of the sound object is an angle between the sound object and a preset microphone array. Then, the first position information is a first included angle corresponding to the first target audio segment, and the first included angle is an average value of included angles between a sound production object corresponding to each audio frame in the first target audio segment and the preset microphone array.
      In the embodiment of the present application, whether the target location information matches the first location information may be determined by whether a difference between the target location information and the first location information is within a preset value range. If the difference value is within the preset value range, the target position information is matched with the first position information, and if the difference value exceeds the preset value range, the target position information is not matched with the first position information.
      The matching degree between the target position information and the first position information can be represented by the difference value between the target position information and the first position information. That is, the matching degree between the target position information and the first position information is the difference between the target position information and the first position information.
      In other embodiments of the present application, the degree of match between the target location information and the first location information may be characterized by a ratio of a difference between the target location information and the first location information to the first location information. And if the ratio of the difference value of the target position information and the first position information to the first position information is within a preset ratio range, representing that the target position information is matched with the first position information. And if the ratio of the difference value of the target position information and the first position information to the first position information is not in the preset ratio range, indicating that the target position information is not matched with the first position information.
      The specific implementation manner of determining whether the target location information matches the first location information is not limited.
      In the embodiment of the present application, in the case that the target position information does not match the first position information, if the second sound-generating object representing that the target audio frame is generated is different from the sound-generating object of the first target audio segment, the sound-generating object may be switched, and the sound-generating object of the first target audio segment may be further determined by using the voiceprint recognition technology.
      Since the first target audio segment may only be a part of the audio signal emitted by the first sound-producing object (i.e. the sound-producing object of the audio frame preceding the target audio frame), when performing the voice stream role separation, the role separation of the second target audio segment emitted by the first sound-producing object is required.
      As one example, the second target audio segment includes the first target audio segment, and the voicing object of the audio frame in the second target audio segment is the same as the voicing object of the audio frame in the first target audio segment. That is, the sound generation object of each audio frame in the second target audio segment is the same as the sound generation object of each audio frame in the first target audio segment, i.e., the first sound generation object.
      In some embodiments, it may be determined whether the sound-generating object of the target audio frame and the sound-generating object of the first target audio segment may be the same, by determining whether the target position information matches the first position information. If the target position information does not match the first position information, the target audio frame may be considered as a starting point of the sound production content of the second sound production object, that is, a first audio frame to be produced, and a previous audio frame of the target audio frame may be considered as a sound production object of the first target audio segment, that is, an end point of the sound production content of the first sound production object, that is, a last audio frame to be produced. Therefore, according to a similar method, the sound-generating object of the first target audio segment, i.e., the starting point of the sound-generating content of the first sound-generating object, i.e., the first audio frame generated by the first sound-generating object, may also be obtained in advance.
      If the target position information does not match the first position information, at least a portion of consecutive audio frames between a first audio frame emitted by the first originating object and an audio frame preceding the target audio frame may be determined as the second target audio segment.
      For example, all audio frames between a first audio frame emitted by a sound-generating object of a first target audio segment and an audio frame preceding the target audio frame may be determined as second target audio segments. That is, the start point and the end point of the determined second target audio segment are the occurrence points at which the position information does not match twice before and after, respectively. And if the position information of the sound-producing object is an included angle between the sound-producing object and the microphone array, the determined starting point and the determined end point of the second target audio frequency segment are respectively the occurrence points of angle conversion of the front and the back.
      It should be noted that the first target audio segment and the second target audio segment each include a plurality of consecutive audio frames.
      When the position information is detected to be changed, determining an audio segment between the occurrence points of the two previous and next position information changes as a second target audio segment, and then extracting the target voiceprint characteristics of the second target audio segment.
      In this embodiment of the application, if the target position information matches the first position information, it represents that the second sound-generating object is a sound-generating object of the first target audio segment, that is, the sound-generating object corresponding to the first target audio segment is the same as the sound-generating object corresponding to the target audio frame, and then the next audio frame is re-acquired and is used as the target audio frame, and the target position information of the second sound-generating object that has generated the target audio frame is acquired, that is, the process returns to step 210.
      It should be noted that, after the target audio frame is retrieved, the first target audio segment in step 220 is updated, i.e., the first target audio segment includes the last target audio frame, and the first position information corresponding to the first target audio segment is updated accordingly.
      The specific implementation of step 230 is described below.
      In the embodiment of the present application, a sound database is pre-established, where the sound database includes a correspondence between a sound-generating object and a voiceprint feature, and a correspondence between the sound-generating object and an audio signal.
      After the target voiceprint feature of the second target audio segment is obtained, in order to determine the sound generating object of the second target audio segment, the target voiceprint feature needs to be matched with each voiceprint feature in the preset sound database, and the sound generating object corresponding to the voiceprint feature matched with the target voiceprint feature in the preset sound database is determined as the target sound generating object corresponding to the second target audio segment. Moreover, the second target audio segment may be determined as the audio signal corresponding to the target sound-emitting object, that is, the second target audio segment may be added as the audio signal corresponding to the target sound-emitting object.
      In some embodiments of the present application, step 230 comprises: under the condition that a first voiceprint feature with the matching degree meeting a first preset matching condition exists in a preset sound database, determining a sound production object corresponding to the first voiceprint feature as a target sound production object corresponding to a second target audio segment; and under the condition that the matching degree of the first voiceprint features and the target voiceprint features meets a second preset matching condition, updating the corresponding voiceprint features of the target sound production object in a preset sound database by using the target voiceprint features.
      And the matching degree which needs to be met corresponding to the second preset matching condition is greater than the matching degree which needs to be met corresponding to the first preset matching condition.
      In the embodiment of the application, only when the matching degree of the first voiceprint feature and the target voiceprint feature meets the second preset matching condition, the voiceprint feature corresponding to the target sound object in the preset sound database is updated by using the target voiceprint feature, so that the richness and the accuracy of the voiceprint feature corresponding to the target sound object can be improved, and the accuracy of voiceprint recognition is improved.
      As an example, the first preset matching condition is that the degree of matching between the voiceprint features in the preset sound database and the target voiceprint features is greater than 80%, and the second preset matching condition is that the degree of matching between the voiceprint features in the preset sound database and the target voiceprint features is greater than 90%.
      In the embodiment of the application, when the matching degree of the first voiceprint feature and the target voiceprint feature does not satisfy the condition of the second preset matching condition, the corresponding voiceprint feature of the target sound production object in the preset sound database is not updated by using the target voiceprint feature, and only the second target audio segment is determined as the audio signal sent by the target sound production object.
      In an embodiment of the present application, if there are a plurality of first voiceprint features, the sound generation object corresponding to the first voiceprint feature with the highest matching degree with the target voiceprint feature is determined as the sound generation object corresponding to the second target audio segment.
      In some embodiments of the present application, in a case that a matching degree between the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second location preset matching degree threshold, the first preset matching condition includes that a matching degree of a voiceprint feature in the preset sound database and the target voiceprint feature is larger than a first preset voiceprint matching degree threshold; the second preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a second preset voiceprint matching degree threshold, wherein the second preset voiceprint matching degree threshold is larger than the first preset voiceprint matching degree threshold.
      Under the condition that the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the first preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a third preset voiceprint matching degree threshold value; the second preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a fourth preset voiceprint matching degree threshold.
      The fourth preset voiceprint matching degree threshold is larger than the third preset voiceprint matching degree threshold, the first preset voiceprint matching degree threshold is smaller than the second preset voiceprint matching degree threshold, the second preset voiceprint matching degree threshold is smaller than the fourth preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold.
      In the embodiment of the application, when the matching degree of the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second preset location matching degree threshold, the method is used for representing that the target location information is not matched with the first location information, but the target location information is more approximate to the first location information. When the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the target position information is not matched with the first position information, and the difference between the target position information and the first position information is large.
      When the target position information is closer to the first position information, then the utterance object representing the second utterance object and the first target audio segment may be the same, i.e., the utterance object of the target audio frame and the utterance object of the second target audio segment may be the same person, and thus the threshold for voiceprint matching may be set slightly lower. When the target position information is significantly different from the first position information, then the voicing object representing the second voicing object and the first target audio segment may be different, i.e., the voicing object for the target audio frame and the voicing object for the second target audio segment may not be the same person, so the threshold for voiceprint matching may be set slightly higher. Namely, the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold, and the second preset voiceprint matching degree threshold is smaller than the fourth preset voiceprint matching degree threshold.
      In some embodiments of the present application, a method for determining an utterance object provided by an embodiment of the present application further includes: under the condition that the matching degree of each voiceprint feature and the target voiceprint feature in the preset sound database meets a third preset matching condition, storing the corresponding relation between the target voiceprint feature and the sound object corresponding to the target voiceprint feature in the preset sound database, and determining the sound object corresponding to the target voiceprint feature as the target sound object of the second target audio segment; and the third preset matching condition is used for representing that the voiceprint features in the preset sound database are not matched with the target voiceprint features.
      In an embodiment of the present application, when each voiceprint feature in the preset sound database does not match the target voiceprint feature, it indicates that the sound generation object corresponding to each voiceprint feature in the preset sound database is not the sound generation object corresponding to the second target audio segment.
      In some embodiments, the sound-generating object corresponding to the target voiceprint feature may be determined from a pre-established correspondence relationship between the voiceprint feature and the sound-generating object. And then updating the target voiceprint characteristics and the corresponding relation between the sound production objects corresponding to the target voiceprint characteristics to a preset sound database. That is, the sound-generating object corresponding to the target voiceprint feature is registered in the preset sound database.
      In other embodiments of the present application, a method for determining an utterance object provided by an embodiment of the present application further includes: and under the condition that the matching degree of each voiceprint feature in the preset voice database and the target voiceprint feature meets the fourth preset matching condition but does not meet the third preset matching condition, discarding the second target voice frequency segment.
      The fourth preset matching condition is also used for representing that the voiceprint features in the preset sound database are not matched with the target voiceprint features, but the degree of mismatching which needs to be met corresponding to the fourth preset matching condition is smaller than the degree of mismatching which needs to be met corresponding to the third preset matching condition.
      That is, when the matching degree between each voiceprint feature in the preset sound database and the target voiceprint feature satisfies the fourth preset matching condition but does not satisfy the third preset matching condition, in order to improve the accuracy of subsequent determination of the sound object, the target voiceprint feature of the second target audio segment and the sound object corresponding to the target voiceprint feature are not registered.
      In some embodiments of the present application, in a case that a matching degree between the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second preset location matching degree threshold, the third preset matching condition includes that a matching degree of a voiceprint feature in the preset sound database and the target voiceprint feature is smaller than a fifth preset voiceprint matching degree threshold.
      And under the condition that the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the third preset matching condition comprises that the matching degree between the voiceprint features in the preset sound database and the target voiceprint features is smaller than a sixth preset voiceprint matching degree threshold value.
      And the fifth preset voiceprint matching degree threshold is smaller than the sixth preset voiceprint matching degree threshold.
      In some embodiments of the present application, in a case that the matching degree between the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second preset location matching degree threshold, the fourth preset matching condition is that the matching degree between the voiceprint feature in the preset sound database and the target voiceprint feature is smaller than a seventh preset voiceprint matching degree threshold.
      And under the condition that the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the fourth preset matching condition is that the matching degree between the voiceprint features in the preset sound database and the target voiceprint features is smaller than an eighth preset voiceprint matching degree threshold value.
      And the seventh preset voiceprint matching degree threshold is smaller than the eighth preset voiceprint matching degree threshold.
      When the target position information is similar to the first position information, it means that the sound-generating object corresponding to the second sound-generating object and the first target audio segment may be the same, that is, the sound-generating object of the target audio frame and the sound-generating object of the second target audio segment may be the same person, so that the mismatch threshold for voiceprint matching may be set slightly lower to improve the accuracy of determining the sound-generating object. When the target position information is significantly different from the first position information, then the utterance object representing the second utterance object corresponding to the first target audio segment may be different, i.e., the utterance object of the target audio frame and the utterance object of the second target audio segment may not be the same person, and thus the voiceprint mismatch threshold may be set slightly higher. That is, the fifth preset voiceprint matching degree threshold is smaller than the sixth preset voiceprint matching degree threshold, and the seventh preset voiceprint matching degree threshold is smaller than the eighth preset voiceprint matching degree threshold. By such an arrangement, the accuracy of determination of the sound-emitting object can be improved.
      That is, in the case where the target location does not match the first location information, the matching degree between the target location information and the first location information is smaller than the first preset location matching degree threshold and larger than the second preset location matching degree threshold, and the matching degree between the target location information and the first location information is smaller than the second preset location matching degree threshold, which may represent two degrees of mismatch. The matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold value and larger than a second preset position matching degree threshold value, the fact that the degree of mismatching between the target position information and the first position information is slightly lower is represented, and the fact that the degree of mismatching between the target position information and the first position information is smaller than the second preset position matching degree threshold value represents that the degree of mismatching between the target position information and the first position information is slightly higher is represented.
      Aiming at two conditions of slightly low mismatching degree and slightly high mismatching degree of the target position information and the first position information, four preset voiceprint matching degree threshold values are corresponding to the voiceprint matching. When the mismatching degree of the target position information and the first position information is slightly low, the target position information is represented to be approximate to the first position information, and in this case, the corresponding four thresholds are a first preset voiceprint matching degree threshold, a second preset voiceprint matching degree threshold, a fifth preset voiceprint matching degree threshold and a seventh preset voiceprint matching degree threshold. When the mismatching degree of the target position information and the first position information is slightly higher, the difference between the target position information and the first position information is larger, and under the condition, the four corresponding voiceprint matching thresholds are a third preset voiceprint matching threshold, a fourth preset voiceprint matching threshold, a sixth preset voiceprint matching threshold and an eighth preset voiceprint matching threshold. And when the degree of mismatching between the target position information and the first position information is slightly lower, the corresponding four preset matching degree threshold values are all smaller than the corresponding four preset matching degree threshold values respectively when the degree of mismatching between the target position information and the first position information is slightly lower. That is, the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold, the second preset voiceprint matching degree threshold is smaller than the fourth preset voiceprint matching degree threshold, the fifth preset voiceprint matching degree threshold is smaller than the sixth preset voiceprint matching degree threshold, and the seventh preset voiceprint matching degree threshold is smaller than the eighth preset voiceprint matching degree threshold.
      It should be noted that the seventh preset voiceprint matching degree threshold is smaller than the first preset voiceprint matching degree threshold, and the eighth preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold.
      Fig. 3 shows a schematic flow chart of voiceprint matching provided in the embodiment of the present application. For example, the position information of the sound-generating object is an angle with the microphone array, and the target position information is a target angle between the second sound-generating object and the microphone array. The first position information is a first angle corresponding to the first target audio segment.
      Referring to fig. 3, when the matching degree between the target angle and the first angle is smaller than the first preset position matching degree threshold and larger than the second preset position matching degree threshold, that is, the target angle is similar to the first angle, and the voiceprint comparison is performed by using the lower preset voiceprint matching degree threshold.
      That is to say, when the preset sound database has the first voiceprint feature whose matching degree with the target voiceprint feature satisfies that the first voiceprint feature is greater than the second preset voiceprint matching degree threshold, it is determined that the probability that the sound object corresponding to the first voiceprint feature is the same as the sound object of the second target audio segment is very high, and the voiceprint feature corresponding to the target sound object is updated by using the target voiceprint feature.
      And under the condition that the matching degree of the preset sound database with the target voiceprint features meets the first voiceprint features which are greater than a first preset voiceprint matching degree threshold and less than a second preset voiceprint matching degree threshold, judging that the probability that the sound object corresponding to the first voiceprint features is the same as the sound object of the second target audio segment is higher, but the probability is lower than the condition that the matching degree is greater than the second preset voiceprint matching degree threshold, and therefore, not utilizing the target voiceprint features to update the voiceprint features corresponding to the target sound object.
      And if the matching degree of each voiceprint feature and the target voiceprint feature in the preset voice database is smaller than a fifth preset voiceprint matching degree threshold value, judging that the probability that the sound-producing object corresponding to each voiceprint feature in the preset voice database is not the sound-producing object corresponding to the target voiceprint feature is very high, and storing the corresponding relation between the target voiceprint feature and the sound-producing object corresponding to the target voiceprint feature in the preset voice database, namely registering the target voiceprint feature and the sound-producing object corresponding to the target voiceprint feature.
      And if the matching degree of each voiceprint feature in the preset voice database and the target voiceprint feature is smaller than the seventh preset voiceprint matching degree threshold and larger than the fifth preset voiceprint matching degree threshold, judging that the probability that the sound-emitting object corresponding to each voiceprint feature in the preset voice database is not the sound-emitting object corresponding to the target voiceprint feature is higher, but is lower than the condition that the matching degree is smaller than the fifth preset voiceprint matching degree threshold, discarding the second target voice frequency segment, and not registering the target voiceprint feature and the sound-emitting object corresponding to the target voiceprint feature.
      Continuing with fig. 3, when the matching degree between the target angle and the first angle is smaller than the second preset position matching degree threshold value, that is, the target angle is different from the first angle, a higher preset voiceprint matching degree threshold value is used for voiceprint comparison.
      That is to say, when the preset sound database has the first voiceprint feature whose matching degree with the target voiceprint feature satisfies that the first voiceprint feature is greater than the fourth preset voiceprint matching degree threshold, it is determined that the probability that the sound object corresponding to the first voiceprint feature is the same as the sound object of the second target audio segment is very high, and the voiceprint feature corresponding to the target sound object is updated by using the target voiceprint feature.
      And under the condition that the matching degree of the preset sound database with the target voiceprint features meets the first voiceprint features which are more than a third preset voiceprint matching degree threshold and less than a fourth preset voiceprint matching degree threshold, judging that the probability that the sound object corresponding to the first voiceprint features is the same as the sound object of the second target audio segment is higher, but the probability is lower than the condition that the matching degree is more than the fourth preset voiceprint matching degree threshold, and therefore, not utilizing the target voiceprint features to update the voiceprint features corresponding to the target sound object.
      And if the matching degree of each voiceprint feature and the target voiceprint feature in the preset voice database is smaller than the eighth preset matching degree threshold value, judging that the probability that the sound object corresponding to each voiceprint feature in the preset voice database is not the sound object corresponding to the target voiceprint feature is very high, and storing the corresponding relation between the target voiceprint feature and the sound object corresponding to the target voiceprint feature in the preset voice database, namely registering the target voiceprint feature and the sound object corresponding to the target voiceprint feature.
      And if the matching degree of each voiceprint feature in the preset voice database and the target voiceprint feature is smaller than the tenth preset matching degree threshold and larger than the eighth preset matching degree threshold, judging that the probability that the sound production object corresponding to each voiceprint feature in the preset voice database is not the sound production object corresponding to the target voiceprint feature is higher, but the probability is lower than the situation that the matching degree is smaller than the sixth preset voiceprint matching degree threshold, discarding the second target voice frequency segment, and not registering the target voiceprint feature and the sound production object corresponding to the target voiceprint feature.
      In the embodiment of the application, by combining the matching degree between the target position information and the first position information and performing voiceprint matching by using two sets of preset voiceprint matching degree thresholds, the accuracy of determining the sound object can be realized.
      In some embodiments of the present application, in order to improve the accuracy of determining the sound generating object, before step 220, the sound generating object determining method provided by an embodiment of the present application further includes: filtering the position information of a sound production object corresponding to the audio frame in the first target audio segment to obtain filtered position information; based on the filtered location information, first location information is determined.
      In some embodiments, the position information of the sounding object corresponding to each audio frame in the first target audio segment may be filtered by a median filter, so as to obtain filtered position information of the sounding object corresponding to each audio frame.
      The idea of median filtering is that the position information of the sound-generating object corresponding to each audio frame can be replaced by a statistical median of the position information of the sound-generating objects corresponding to all audio frames in a neighborhood of a preset size of the audio frame.
      As an example, if the position information of the sound generating object is an angle between the sound source and the microphone array, the first position information is an average value of the filtered angles of the sound generating objects corresponding to each audio frame.
      In the embodiment of the application, the position information of the sound-producing object corresponding to each audio frame in the first target audio segment is filtered, so that some noises and burrs can be filtered, the position information is more stable, and the smooth position information is obtained, thereby improving the accuracy of determining the sound-producing object.
      In other embodiments of the present application, the position information of the sound-generating object corresponding to each audio frame in the first target audio segment may be filtered in other manners, for example, an averaging filter may be used.
      Fig. 4 is a flowchart illustrating a method for determining an utterance object according to a second aspect of the present application. As shown in fig. 4, the second aspect of the present application provides a sound emission target determination method 400 including:
       step 410, acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
       step 420, if it is determined that the target position information does not match the first position information corresponding to the first target audio segment, extracting a target voiceprint feature of the first target audio segment, where the first target audio segment includes all consecutive audio frames emitted by the first sound-emitting object, and the first sound-emitting object is a sound-emitting object of a previous audio frame from which the target audio frame was emitted; the end point of the first target audio segment is a previous audio frame of the target audio frame;
       step 430, determining a target sound-generating object of the first target audio segment according to the target voiceprint characteristics.
      In the embodiment of the present application, the specific implementation manner of step 410 is similar to that of step 210, and is not described herein again.
      In the embodiment of the present application, the specific implementation manner of step 420 is similar to that of step 220. In step 420, the difference from step 220 is that the first target audio segment includes all consecutive audio frames emitted by the first sound-emitting object, and the first sound-emitting object is a sound-emitting object of an audio frame previous to the emitted target audio frame; the end point of the first target audio segment is an audio frame preceding the target audio frame.
      Whereas in step 220 the first target audio segment is the first N audio frames comprising the first originating object originating before the target audio frame, i.e. not necessarily all consecutive audio frames originating from the first originating object.
      In the embodiment of the application, since the position information included in the first position information corresponding to all the continuous audio frames emitted by the first sound-emitting object is richer, the first position information can more accurately represent the position information of the first sound-emitting object. Therefore, the first position information is matched with the target position information, whether the sounding object is switched or not can be judged more accurately, and the accuracy of role separation can be improved.
      In the embodiment of the present application, a specific implementation manner of step 430 is similar to that of step 230, and the identity of the first sound-emitting object may be determined according to the target voiceprint feature of the first target audio segment, which is not described herein again.
      In an embodiment of the present invention, in a case where it is determined that target position information of a second sound-emitting object from which the target audio frame is emitted does not match first position information corresponding to the first target audio segment, it may be determined that the first sound-emitting object from which the first target audio segment is emitted and the second sound-emitting object from which the target audio frame is emitted are different. And then, extracting the target voiceprint characteristics of the first target audio segment, and determining the target sound production object of the second target audio segment according to the target voiceprint characteristics, namely determining the identity of the first sound production object, so as to realize role separation. By combining the voiceprint recognition technology and the sound source positioning technology, the accuracy of determining the sounding object can be improved, and therefore the accuracy of determining the sounding object is improved.
      In an embodiment of the present application, a specific implementation of the method for determining a sounding object provided in the second aspect is similar to that of the method for determining a sounding object provided in the first aspect, and is not described herein again.
      In the embodiment of the present application, if the role separation is to be implemented, it is necessary to determine the starting point and the ending point of the vocalized content of each vocalized object to separate the vocalized content of each vocalized object, so as to implement the role separation. Fig. 5 is a flowchart illustrating a method for determining a starting point of utterance content according to a third aspect of the present application. As shown in fig. 5, a method 500 for determining a starting point of utterance content provided in a third aspect of the present application includes:
       step 510, acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
       step 520, determining that the target position information is not matched with the first position information corresponding to the first target audio segment, and determining the target audio frame as a starting point of the sound production content of the second sound production object;
      the first target audio segment comprises all continuous audio frames emitted by a first sound-emitting object, and the first sound-emitting object is a sound-emitting object of an audio frame before the target audio frame is emitted; the end point of the first target audio segment is an audio frame preceding the target audio frame.
      In the embodiment of the present application, the specific implementation manner of step 510 is similar to that of step 210, and is not described herein again.
      In step 520, in the case that it is determined that the target position information does not match the first position information corresponding to the first target audio segment, it is determined that the sound generating object generates a switch, that is, the sound generating object is switched from the first sound generating object to the second sound generating object. Therefore, the previous audio frame of the target audio frame may be used as the end point of the utterance content of the first utterance object, and the target audio frame may be used as the start point of the utterance content of the second utterance object, so as to extract the entire utterance content of the second utterance object subsequently.
      In the embodiment of the application, whether the sound-producing object is switched or not can be determined by matching the target position information of the sound-producing object of the target audio frame with the first position information corresponding to all the continuous audio frames of the first sound-producing object, so that the starting point and the end point of the sound-producing content of each sound-producing object can be determined, and the role separation is realized.
      In some scenarios, different sound objects may sound, for example, in a meeting room scenario, different participants may speak. In order to improve the efficiency of the conference, the current sound-emitting object can be prompted so that other users can know the identity of the current sound-emitting object. Therefore, it is desirable to provide a method for changing the identification of a sound object to prompt the identity of the current sound object. Fig. 6 is a flowchart illustrating a method for changing an identification of a sound-generating object according to a fourth aspect of the present application. As shown in fig. 6, a sound emission target identification changing method 600 according to a fourth aspect of the present application includes:
       step 610, acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
       step 620, if it is determined that the target position information does not match the first position information corresponding to the first target audio segment, extracting target voiceprint characteristics of a second target audio segment, wherein the first target audio segment includes the first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
       step 630, determining a target sound-generating object of the second target audio segment according to the target voiceprint characteristics;
      and step 640, changing the identification of the second sound-emitting object and the identification of the target sound-emitting object, wherein the identifications are used for representing the sound-emitting state of the sound-emitting objects.
      In the embodiment of the present application, a specific implementation manner of step 610 is similar to a specific implementation manner of step 210, a specific implementation manner of step 620 is similar to a specific implementation manner of step 220, and a specific implementation manner of step 630 is similar to a specific implementation manner of step 230, and is not described herein again.
      In an embodiment of the present application, in a case where it is determined that the target position information does not match the first position information corresponding to the first target audio piece, it may be determined that the sound emission object is switched. That is, the sound-generating object is switched from the target sound-generating object of the second target audio segment to the second sound-generating object of the target audio frame.
      Thus, the identity of the second sound-generating object and the identity of the target sound-generating object may be altered to prompt the sound-generating object to switch from the target sound-generating object to the second sound-generating object. The identification of the sound-emitting object is used for representing the sound-emitting state of the sound-emitting object.
      As one example, the identification of the sound emitting object may be the brightness of the image of the sound emitting object. For example, when the brightness of the image of the sound-generating object is the first preset brightness, the method is used for identifying that the sound-generating object is currently in a sound-generating state. And if the brightness of the image of the sound-producing object is the second preset brightness, the method is used for identifying that the sound-producing object is in the non-sound-producing state currently.
      In the embodiment of the application, if the current sound-generating object is switched from the target sound-generating object to the second sound-generating object, the brightness of the image of the target sound-generating object is changed from the first preset brightness to the second preset brightness, so as to represent that the target sound-generating object stops generating sound. The brightness of the image of the second sound-producing object is changed from the second preset brightness to the first preset brightness, and the brightness is used for representing that the second sound-producing object starts to produce sound.
      In other embodiments of the present application, the identification of the sound-emitting object may be a label of the sound-emitting object. For example, when the label of the sound-emitting object is the first preset label, the first preset label is used for identifying that the sound-emitting object is currently in a sound-emitting state. And if the label of the sound-producing object is the second preset label, the sound-producing object is used for identifying that the sound-producing object is in the non-sound-producing state at present.
      In the embodiment of the application, if the current sound-emitting object is switched from the target sound-emitting object to the second sound-emitting object, the label of the target sound-emitting object is changed from the first preset label to the second preset label, so as to represent that the target sound-emitting object stops emitting sound. And the label of the second sound-emitting object is changed from a second preset label to a first preset label, and the second preset label is used for representing that the second sound-emitting object starts to emit sound.
      In an embodiment of the present application, by using the target position information of the sound-generating object of the target audio frame to match with the first position information corresponding to the first target audio segment, it may be determined whether the second sound-generating object is the same as the sound-generating object of the second target audio segment, i.e., whether a switch occurs in the current sound-generating object. And under the condition that the switching of the sound-emitting objects is determined, the identification of the second sound-emitting object and the identification of the target sound-emitting object are changed, so that the identity of the current sound-emitting object can be prompted.
      In some conversation scenarios, after acquiring the audio conversation data in the conversation scenario, the audio conversation data needs to be processed to obtain a conversation record, so as to record the content of the conversation. Accordingly, the present application provides a session record generation method. Fig. 7 is a flowchart illustrating a session record generation method according to a fifth aspect of the present application. As shown in fig. 7, a session record generating method 700 provided by the fifth aspect of the present application includes:
       step 710, acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object in the audio session data;
       step 720, if the target position information is determined not to match the first position information corresponding to the first target audio segment in the audio session data, extracting the target voiceprint characteristics of the second target audio segment in the audio session data, wherein the first target audio segment comprises the first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
       step 730, determining a target sound production object of a second target audio segment according to the target voiceprint characteristics;
       step 740, associating the target sound-producing object with the text content corresponding to the second target audio segment to obtain a conversation record of the target sound-producing object.
      In the embodiment of the present application, a specific implementation manner of step 710 is similar to that of step 210, a specific implementation manner of step 720 is similar to that of step 220, and a specific implementation manner of step 730 is similar to that of step 230, and therefore, details are not described herein again.
      It should be noted that the audio collector may be used to collect audio session data in the session scene. In order to facilitate generation of a conversation record, in the process of collecting audio conversation data, position information of a sound emission object of each audio frame in the audio conversation data is also acquired.
      In an embodiment of the application, when it is determined that the target position information does not match the first position information corresponding to the first target audio segment in the audio session data, it is determined that the utterance object of the target audio frame is different from the utterance object of the first target audio segment, and the utterance content of the utterance object of the first target audio segment needs to be extracted. Due to the switching of the utterance object, the target audio frame may be used as a start point of the utterance content of the second utterance object, and a frame preceding the target audio frame may be used as an end point of the utterance content of the utterance object of the first target audio segment. With regard to the relationship between the first target audio segment and the second target audio segment, reference may be made to the statements of embodiments of the utterance object determination method provided in the first aspect.
      And after the target sound production object of the second target audio segment is determined based on the target voiceprint characteristics, associating the text content corresponding to the second target audio segment with the target sound production object, thereby obtaining the conference record of the target sound production object.
      In the embodiment of the application, the second target audio segment corresponding to each sound-emitting object in the audio session data can be extracted by the method, so that the conference record of the audio session data can be obtained.
      To improve the integrity of the conference recording, the second target audio segment may include all of the successive audio frames emitted by the first originating object that is an originating object from an audio frame that is prior to the emission of the target audio frame.
      In the embodiment of the application, when it is determined that the target position information does not match the first position information corresponding to the first target audio segment in the audio session data, it is determined that the sound generating object of the target audio frame is different from the sound generating object of the first target audio segment, so that the text content corresponding to the second target audio segment which is the same as the sound generating object of the first target audio segment can be associated with the target object, thereby forming a conference record, so as to extract the record of the audio session data in the subsequent process, and improve convenience.
      In an embodiment of the present application, an execution subject of the sound emission target determination method provided in the embodiment of the present application may be a sound emission target determination device. In the embodiments of the present application, the sound-generating object specifying device provided in the embodiments of the present application will be described by taking the sound-generating object specifying method executed by the sound-generating object specifying device as an example.
      Fig. 8 is a schematic structural diagram of the sound emission target determination apparatus provided in the sixth aspect. As shown in fig. 8, the sound emission target determination device 800 includes:
      an obtaining module 810, configured to obtain a target audio frame emitted by a second sound-generating object and target position information of the second sound-generating object;
      an extraction module 820, configured to extract a target voiceprint feature of a second target audio segment if it is determined that the target position information does not match first position information corresponding to a first target audio segment, where the first target audio segment includes first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment; n is an integer greater than or equal to 1;
      the first determining module 830 is configured to determine a target sound-generating object of the second target audio segment according to the target voiceprint feature.
      According to the embodiment of the present invention, in the case where it is determined that the target position information of the second sound-emitting object from which the target audio frame is emitted does not match the first position information corresponding to the first target audio segment, it may be determined that the first sound-emitting object from which the first target audio segment is emitted is different from the second sound-emitting object from which the target audio frame is emitted. And then, extracting the target voiceprint characteristic of a second target audio segment emitted by the first sound-emitting object, and determining the target sound-emitting object of the second target audio segment according to the target voiceprint characteristic, namely determining the identity of the first sound-emitting object so as to realize role separation. By combining the voiceprint recognition technology and the sound source positioning technology, the accuracy of determining the sounding object can be improved, and therefore the accuracy of determining the sounding object is improved.
      In some embodiments of the present application, the target position information includes relative position information between the second sound-emitting object and a preset audio collector.
      In some embodiments of the present application, the sound emission target determination device 800 further includes:
      the filtering module is used for filtering the position information of the sounding object corresponding to the audio frame in the first target audio segment to obtain the filtered position information;
      a second determination module to determine the first location information based on the filtered location information.
      In some embodiments of the present application, the first determining module 830 is configured to:
      under the condition that a first voiceprint feature with the matching degree meeting a first preset matching condition exists in a preset sound database, determining a sound production object corresponding to the first voiceprint feature as a target sound production object of a second target audio segment;
      under the condition that the matching degree of the first voiceprint features and the target voiceprint features meets a second preset matching condition, the corresponding voiceprint features of the target sound production object in a preset sound database are updated by using the target voiceprint features;
      and the matching degree which needs to be met corresponding to the second preset matching condition is greater than the matching degree which needs to be met corresponding to the first preset matching condition.
      In some embodiments of the present application, in a case that a matching degree between the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second preset location matching degree threshold, the first preset matching condition includes that a matching degree of a voiceprint feature in the preset sound database and the target voiceprint feature is larger than a first preset voiceprint matching degree threshold; the second preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a second preset voiceprint matching degree threshold, wherein the second preset voiceprint matching degree threshold is larger than the first preset voiceprint matching degree threshold.
      Under the condition that the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the first preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a third preset voiceprint matching degree threshold value; the second preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a fourth preset voiceprint matching degree threshold.
      The fourth preset voiceprint matching degree threshold is larger than the third preset voiceprint matching degree threshold, the first preset voiceprint matching degree threshold is smaller than the second preset voiceprint matching degree threshold, and the second preset voiceprint matching degree threshold is smaller than the fourth preset voiceprint matching degree threshold.
      In some embodiments of the present application, the sound emission target determination device 400 further includes:
      the processing module is used for storing the corresponding relation between the target voiceprint characteristics and the sound production objects corresponding to the target voiceprint characteristics in the preset sound database under the condition that the matching degree of each voiceprint characteristic in the preset sound database and the target voiceprint characteristics meets a third preset matching condition, and determining the sound production objects corresponding to the target voiceprint characteristics as the target sound production objects of the second target audio band;
      and the third preset matching condition is used for representing that the voiceprint features in the preset sound database are not matched with the target voiceprint features.
      In some embodiments of the present application, in a case that a matching degree between the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second preset location matching degree threshold, the third preset matching condition includes that a matching degree of a voiceprint feature in the preset sound database and the target voiceprint feature is smaller than a fifth preset voiceprint matching degree threshold.
      And under the condition that the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the third preset matching condition comprises that the matching degree between the voiceprint features in the preset sound database and the target voiceprint features is smaller than a sixth preset voiceprint matching degree threshold value.
      And the fifth preset voiceprint matching degree threshold is smaller than the sixth preset voiceprint matching degree threshold.
      Other details of the apparatus 800 for determining a sound generating object according to the embodiment of the present invention are similar to those of the method for determining a sound generating object provided in the first aspect, and are not described herein again.
      Fig. 9 is a schematic structural diagram of a sound emission target determination device according to a seventh aspect. As shown in fig. 9, the sound emission target determination device 900 includes:
      an obtaining module 910, configured to obtain a target audio frame emitted by a second sound-generating object and target position information of the second sound-generating object;
      an extracting module 920, configured to determine that the target position information does not match first position information corresponding to a first target audio segment, and extract a target voiceprint feature of the first target audio segment, where the first target audio segment includes all consecutive audio frames emitted by a first sound-emitting object, and the first sound-emitting object is a sound-emitting object of a previous audio frame from which the target audio frame is emitted; the end point of the first target audio segment is a previous audio frame of the target audio frame;
      a first determining module 930 configured to determine a target sound-generating object of the first target audio segment according to the target voiceprint characteristics.
      In an embodiment of the present invention, in a case where it is determined that target position information of a second sound-emitting object from which the target audio frame is emitted does not match first position information corresponding to the first target audio segment, it may be determined that the first sound-emitting object from which the first target audio segment is emitted and the second sound-emitting object from which the target audio frame is emitted are different. And then, extracting the target voiceprint characteristics of the first target audio segment, and determining the target sound production object of the second target audio segment according to the target voiceprint characteristics, namely determining the identity of the first sound production object, so as to realize role separation. By combining the voiceprint recognition technology and the sound source positioning technology, the accuracy of determining the sounding object can be improved, and therefore the accuracy of determining the sounding object is improved.
      Other details of the apparatus 900 for determining a sound generating object according to the embodiment of the present invention are similar to those of the method for determining a sound generating object provided in the second aspect, and are not described herein again.
      Fig. 10 is a schematic structural diagram of a spoken content origin determining apparatus provided in the eighth aspect. As shown in fig. 10, the apparatus 1000 for determining the starting point of utterance content includes:
      an obtaining module 1010, configured to obtain a target audio frame emitted by a second sound-generating object and target position information of the second sound-generating object;
      a first determining module 1020, configured to determine that the target position information does not match the first position information corresponding to the first target audio segment, and determine the target audio frame as a starting point of the utterance content of the second utterance object;
      the first target audio segment comprises all continuous audio frames emitted by a first sound-emitting object, and the first sound-emitting object is a sound-emitting object of an audio frame before the target audio frame is emitted; the end point of the first target audio segment is an audio frame preceding the target audio frame
      In the embodiment of the application, whether the sound-producing object is switched or not can be determined by matching the target position information of the sound-producing object of the target audio frame with the first position information corresponding to all the continuous audio frames of the first sound-producing object, so that the starting point and the end point of the sound-producing content of each sound-producing object can be determined, and the role separation is realized.
      Other details of the apparatus 1000 for determining a sound generating object according to the embodiment of the present invention are similar to the method for determining a starting point of sound generating content provided in the third aspect above, and are not described herein again.
      Fig. 11 is a schematic structural diagram of a sound object identifier changing apparatus provided in the ninth aspect. As shown in fig. 11, the sound generation target identification changing apparatus 1100 includes:
      an obtaining module 1110, configured to obtain a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
      an extraction module 1120, configured to extract a target voiceprint feature of a second target audio segment if it is determined that the target location information does not match first location information corresponding to a first target audio segment, where the first target audio segment includes a first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
      a first determining module 1130, configured to determine a target sound-generating object of the second target audio segment according to the target voiceprint feature;
      and an altering module 1140, configured to alter the identity of the second sound-generating object and the identity of the target sound-generating object, and identify the sound-generating state used for representing the sound-generating object.
      In an embodiment of the present application, by using the target position information of the sound-generating object of the target audio frame to match with the first position information corresponding to the first target audio segment, it may be determined whether the second sound-generating object is the same as the sound-generating object of the second target audio segment, i.e., whether a switch occurs in the current sound-generating object. And under the condition that the switching of the sound-emitting objects is determined, the identification of the second sound-emitting object and the identification of the target sound-emitting object are changed, so that the identity of the current sound-emitting object can be prompted.
      Other details of the apparatus 1100 for changing an identification of a sound generating object according to the embodiment of the present invention are similar to those of the method for changing an identification of a sound generating object provided in the fourth aspect, and are not described herein again.
      Fig. 12 is a schematic structural diagram of a session record generation apparatus according to the tenth aspect. As shown in fig. 12, the session record generation apparatus 1200 includes:
      an obtaining module 1210, configured to obtain a target audio frame emitted by a second sound-generating object in the audio session data and target position information of the second sound-generating object;
      an extracting module 1220, configured to determine that the target position information does not match the first position information corresponding to the first target audio segment in the audio session data, and extract a target voiceprint feature of a second target audio segment in the audio session data, where the first target audio segment includes the first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
      a first determining module 1230, configured to determine a target sound generation object of the second target audio segment according to the target voiceprint feature;
      and the association module 1240 is configured to associate the target sound-generating object with the text content corresponding to the second target audio segment, so as to obtain a session record of the target sound-generating object.
      In the embodiment of the application, when it is determined that the target position information does not match the first position information corresponding to the first target audio segment in the audio session data, it is determined that the sound generating object of the target audio frame is different from the sound generating object of the first target audio segment, so that the text content corresponding to the second target audio segment which is the same as the sound generating object of the first target audio segment can be associated with the target object, thereby forming a conference record, so as to extract the record of the audio session data in the subsequent process, and improve convenience.
      Other details of the session record generating apparatus 1200 according to the embodiment of the present invention are similar to the session record generating method provided in the fourth aspect above, and are not described herein again.
      The methods provided by any of the first, second, third, fourth, and fifth aspects and the apparatus provided by any of the sixth, seventh, eighth, ninth, and tenth aspects described in conjunction with fig. 2-12 may be implemented by a computing device. Fig. 13 is a hardware configuration diagram of a computing device 1300 according to an embodiment of the invention.
      As shown in fig. 13, computing device 1300 includes an input device 1301, an input interface 1302, a processor 1303, a memory 1304, an output interface 1305, and an output device 1306. The input interface 1302, the processor 1303, the memory 1304, and the output interface 1305 are connected to each other via a bus 1310, and the input device 1301 and the output device 1306 are connected to the bus 1310 via the input interface 1302 and the output interface 1305, respectively, and further connected to other components of the computing device 1300.
      Specifically, the input device 1301 receives input information from the outside and transmits the input information to the processor 1303 through the input interface 1302; the processor 1303 processes input information based on computer-executable instructions stored in the memory 1304 to generate output information, stores the output information in the memory 1304 temporarily or permanently, and then transmits the output information to the output device 1306 through the output interface 1305; output device 1306 outputs output information to the exterior of computing device 1300 for use by a user.
      The processor 1303 may include: a Central Processing Unit (CPU), a Network Processor (NPU), a Tensor Processing Unit (TPU), a Field Programmable Gate Array (FpGA) chip, an Artificial Intelligence (AI) chip, and the like, and the drawings are merely exemplary and are not limited to the type of processors listed herein.
      That is, the computing device shown in fig. 13 may also be implemented to include: a memory storing computer-executable instructions; and a processor which, when executing the computer-executable instructions, may implement any embodiment of any of the first to tenth aspects.
      The embodiment of the invention also provides a computer storage medium, wherein the computer storage medium is stored with computer program instructions; the computer program instructions, when executed by a processor, implement the method for determining an utterance object provided by an embodiment of the present invention.
      The functional blocks shown in the above structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
      It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
      As will be apparent to those skilled in the art, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.