[go: up one dir, main page]

CN114005451A - Sound production object determination method and device, computing equipment and medium - Google Patents

Sound production object determination method and device, computing equipment and medium Download PDF

Info

Publication number
CN114005451A
CN114005451A CN202010736267.2A CN202010736267A CN114005451A CN 114005451 A CN114005451 A CN 114005451A CN 202010736267 A CN202010736267 A CN 202010736267A CN 114005451 A CN114005451 A CN 114005451A
Authority
CN
China
Prior art keywords
target
position information
preset
sound
target audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010736267.2A
Other languages
Chinese (zh)
Inventor
郑斯奇
王宪亮
索宏彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010736267.2A priority Critical patent/CN114005451A/en
Publication of CN114005451A publication Critical patent/CN114005451A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本发明公开了一种发声对象确定方法、装置、计算设备和介质。该方法,包括:获取第二发声对象发出的目标音频帧以及第二发声对象的目标位置信息;确定目标位置信息与第一目标音频段对应的第一位置信息不匹配,则提取第二目标音频段的目标声纹特征,其中,第一目标音频段包括目标音频帧的前N个音频帧;第一目标音频段中的音频帧的发声对象和第二目标音频段中的音频帧的发声对象相同;第一目标音频段为第二目标音频段的至少一部分;N为大于或等于1的整数;根据目标声纹特征确定第二目标音频段的目标发声对象。能够提高确定发声对象的准确性。

Figure 202010736267

The invention discloses a method, a device, a computing device and a medium for determining a sounding object. The method includes: acquiring a target audio frame sent by a second sounding object and target position information of the second sounding object; determining that the target position information does not match the first position information corresponding to the first target audio segment, then extracting the second target audio The target voiceprint feature of the segment, wherein the first target audio segment includes the first N audio frames of the target audio frame; the vocal object of the audio frame in the first target audio segment and the vocal object of the audio frame in the second target audio segment The same; the first target audio segment is at least a part of the second target audio segment; N is an integer greater than or equal to 1; the target vocal object of the second target audio segment is determined according to the target voiceprint feature. The accuracy of determining the sounding object can be improved.

Figure 202010736267

Description

Sound production object determination method and device, computing equipment and medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a computing device, and a medium for determining a sound object.
Background
Conversational speech is present in many scenarios, such as daily life, meetings, and telephone conversations. In practical applications, in order to analyze a speech signal more accurately, it is necessary to perform not only speech recognition but also role separation of the speech to determine the sound object of each part of the speech. After the sound object of the voice is determined, a wider application space is generated. For example, in a scene of a large conference room with multiple persons, by separating roles of voices in the conference, recording of the conference can be completed quickly, and contents spoken by each speaker in the conference room can be recorded.
At present, the determination of the sound-emitting object is mostly realized by a voiceprint recognition technology, but the accuracy is low. Therefore, it is urgently needed to provide a sound emission object determination method with higher accuracy.
Disclosure of Invention
The embodiment of the invention provides a method and a device for determining a sound generating object, a computing device and a medium, which can solve the problem of low accuracy of determining the sound generating object.
According to a first aspect of embodiments of the present invention, there is provided a sound emission object determination method, including:
acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
if the target position information is determined not to be matched with first position information corresponding to a first target audio segment, extracting target voiceprint characteristics of a second target audio segment, wherein the first target audio segment comprises the first N audio frames of the target audio frames; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment; n is an integer greater than or equal to 1;
and determining a target sound production object of the second target audio segment according to the target voiceprint characteristics.
According to a second aspect of the embodiments of the present invention, there is provided a sound emission target determination method including:
acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
if the target position information is determined not to be matched with first position information corresponding to a first target audio segment, extracting target voiceprint characteristics of the first target audio segment, wherein the first target audio segment comprises all continuous audio frames emitted by a first sound-emitting object, and the first sound-emitting object is a sound-emitting object of a previous audio frame from which the target audio frame is emitted; the end point of the first target audio segment is an audio frame preceding the target audio frame;
and determining a sound production object of the first target audio segment according to the target voiceprint characteristics.
According to a third aspect of the embodiments of the present invention, there is provided a method for determining a starting point of utterance content, including:
acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
determining that the target position information is not matched with first position information corresponding to a first target audio segment, and determining the target audio frame as a starting point of the sound production content of the second sound production object;
wherein the first target audio segment comprises all consecutive audio frames emitted by a first sound-emitting object, the first sound-emitting object being a sound-emitting object of an audio frame preceding the target audio frame; the end point of the first target audio segment is an audio frame preceding the target audio frame.
According to a fourth aspect of the embodiments of the present invention, there is provided a sound emission object identifier changing method, including:
acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
if the target position information is determined not to be matched with first position information corresponding to a first target audio segment, extracting target voiceprint characteristics of a second target audio segment, wherein the first target audio segment comprises the first N audio frames of the target audio frames; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
determining a target sound production object of the second target audio segment according to the target voiceprint characteristics;
and changing the identification of the second sound-emitting object and the identification of the target sound-emitting object, wherein the identification is used for representing the sound-emitting state of the sound-emitting object.
According to a fifth aspect of the embodiments of the present invention, there is provided a session record generating method, including:
acquiring a target audio frame sent by a second sound-producing object in audio session data and target position information of the second sound-producing object;
if the target position information is determined not to be matched with first position information corresponding to a first target audio segment in the audio session data, extracting target voiceprint characteristics of a second target audio segment in the audio session data, wherein the first target audio segment comprises the first N audio frames of the target audio frames; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
determining a target sound production object of the second target audio segment according to the target voiceprint characteristics;
and associating the target sound-emitting object with the text content corresponding to the second target audio segment to obtain a conversation record of the target sound-emitting object.
According to a sixth aspect of the embodiments of the present invention, there is provided a sound emission target determination device including:
the acquisition module is used for acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
an extraction module, configured to extract a target voiceprint feature of a second target audio segment if it is determined that the target position information does not match first position information corresponding to a first target audio segment, where the first target audio segment includes first N audio frames of the target audio frames; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment; n is an integer greater than or equal to 1;
and the first determining module is used for determining a target sound production object of the second target audio segment according to the target voiceprint characteristics.
According to a seventh aspect of the embodiments of the present invention, there is provided a sound emission target determination device including:
the acquisition module is used for acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
an extraction module, configured to extract a target voiceprint feature of a first target audio segment if it is determined that the target position information does not match first position information corresponding to the first target audio segment, where the first target audio segment includes all consecutive audio frames emitted by a first sound-emitting object, and the first sound-emitting object is a sound-emitting object that emits a previous audio frame of the target audio frame; the end point of the first target audio segment is an audio frame preceding the target audio frame;
and the first determining module is used for determining a sound production object of the first target audio segment according to the target voiceprint characteristics.
According to an eighth aspect of the embodiments of the present invention, there is provided a spoken content origin determining apparatus including:
the acquisition module is used for acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
a first determining module, configured to determine that the target position information does not match first position information corresponding to a first target audio segment, and determine the target audio frame as a starting point of the sound production content of the second sound production object;
wherein the first target audio segment comprises all consecutive audio frames emitted by a first sound-emitting object, the first sound-emitting object being a sound-emitting object of an audio frame preceding the target audio frame; the end point of the first target audio segment is an audio frame preceding the target audio frame.
According to a ninth aspect of the embodiments of the present invention, there is provided a sound emission target mark changing device including:
the acquisition module is used for acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
an extraction module, configured to extract a target voiceprint feature of a second target audio segment if it is determined that the target position information does not match first position information corresponding to a first target audio segment, where the first target audio segment includes a first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
a first determining module, configured to determine a target sound generation object of the second target audio segment according to the target voiceprint feature;
and the changing module is used for changing the identifier of the second sound-emitting object and the identifier of the target sound-emitting object, wherein the identifiers are used for representing the sound-emitting state of the sound-emitting object.
According to a tenth aspect of the embodiments of the present invention, there is provided a session record generating apparatus including:
the acquisition module is used for acquiring a target audio frame emitted by a second sound-emitting object in audio session data and target position information of the second sound-emitting object;
an extraction module, configured to extract a target voiceprint feature of a second target audio segment in the audio session data if it is determined that the target position information does not match first position information corresponding to a first target audio segment in the audio session data, where the first target audio segment includes a first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
a first determining module, configured to determine a target sound generation object of the second target audio segment according to the target voiceprint feature;
and the association module is used for associating the target sound-emitting object with the character content corresponding to the second target audio segment to obtain the conversation record of the target sound-emitting object.
According to an eleventh aspect of embodiments of the present invention, there is provided a computing device comprising: a processor and a memory storing computer program instructions;
the processor, when executing the computer program instructions, implements the methods as provided in the first, second, third, fourth or fifth aspects above.
According to a twelfth aspect of embodiments of the present invention, there is provided a computer storage medium having computer program instructions stored thereon, which when executed by a processor, implement the method as provided in the first, second, third, fourth or fifth aspect described above.
According to the embodiment of the present invention, in the case where it is determined that the target position information of the second sound-emitting object from which the target audio frame is emitted does not match the first position information corresponding to the first target audio segment, it may be determined that the first sound-emitting object from which the first target audio segment is emitted is different from the second sound-emitting object from which the target audio frame is emitted. And then, extracting the target voiceprint characteristic of a second target audio segment emitted by the first sound-emitting object, and determining the target sound-emitting object of the second target audio segment according to the target voiceprint characteristic, namely determining the identity of the first sound-emitting object so as to realize role separation. By combining the voiceprint recognition technology and the sound source positioning technology, the accuracy of determining the sounding object can be improved, and therefore the accuracy of determining the sounding object is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic application scenario diagram of a method for determining a sound object according to a first aspect of the present application;
fig. 2 is a schematic flowchart of an embodiment of a method for determining an utterance object according to a first aspect of the present application;
FIG. 3 is a schematic flow chart of voiceprint matching provided by the present application;
fig. 4 is a schematic flow chart of a method for determining a sound object provided in the second aspect of the present application;
fig. 5 is a flowchart illustrating an embodiment of a method for determining a starting point of utterance content according to a third aspect of the present application;
fig. 6 is a schematic flowchart of an embodiment of a method for changing an identification of a sound object according to a fourth aspect of the present application;
fig. 7 is a flowchart illustrating an embodiment of a session record generation method according to a fifth aspect of the present application;
fig. 8 is a schematic structural diagram of an embodiment of a sound-generating object determining apparatus according to a sixth aspect of the present application;
fig. 9 is a schematic structural diagram of an embodiment of a sound emission target determination apparatus according to a seventh aspect of the present application;
fig. 10 is a schematic structural diagram of an embodiment of a spoken content origin determining apparatus according to an eighth aspect of the present application;
fig. 11 is a schematic structural diagram of an embodiment of a sound object identifier changing apparatus according to a ninth aspect of the present application;
fig. 12 is a schematic structural diagram of an embodiment of a session record generation apparatus according to a tenth aspect of the present application;
fig. 13 is a schematic diagram of an embodiment of a hardware structure of a computing device provided in the present application.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Fig. 1 is a schematic view of an application scenario of the method for determining a sound object provided in the present application. For example, fig. 1 shows a conference room scenario. A conference room includes a plurality of people participating in a conference. Only 3 participants, namely participant a, participant B and participant C, are schematically shown in fig. 1, and the number of participants is not limited. In order to facilitate the subsequent rapid extraction of the voice content in the conference process, the voices of a plurality of participants in the whole conference process can be separated in roles.
In the process of sending out the audio signal by the participant, the audio signal can be collected by the audio collector preset in the conference room. After the audio collector obtains the audio signals sent by the participants, the audio signal sent by the participants collected by the audio collector can be subjected to framing processing according to the preset time length, and then the audio frames sent by the participants can be obtained.
In the embodiment of the application, the audio frame currently acquired by the audio acquisition device is used as the target audio frame.
It should be noted that, in the embodiment of the present application, in order to achieve accurate determination of the sound emission object, it is also necessary to acquire position information of the sound emission object from which the target audio frame is emitted. For each target audio frame, positional information of a sound-emitting object from which the target audio frame is emitted may be determined by a sound source localization technique. For example, the position information of the sound-generating object that generates the target audio frame may be acquired by using an audio collector installed in the conference room in advance.
The sound source localization technique is a technique of acquiring position information of a sound source. In some embodiments, the sound source position information may be relative position information between the sound-emitting object and a preset audio collector. For example, the sound source position information may include an angle between the sound-emitting object and a preset audio collector.
In some embodiments, the preset audio collector may be a microphone array, and the sound source localization technology may be a microphone array sound source localization. The microphone array is composed of several to thousands of microphones which are arranged according to a certain rule. After the audio signals are received by the microphone array, a time delay estimation method is adopted to position the sound source. Specifically, the audio signals are received through the microphone array, the time delay of the audio signals received by each microphone relative to the audio signals received by the reference point is calculated, and the sound source is positioned according to the calculated time delay.
For example, participant B speaks during the time period 0-t 1 of the conference. Participant a speaks during the time period t 1-t 2 of the conference, and participant C speaks during the time period t 2-t 3. The time t3 is later than the time t2, and the time t2 is later than the time t 1. And the audio collector in the conference room collects the audio frames sent by the participants in the conference room in real time.
In the embodiment of the application, when the audio collector in the conference room acquires the 1 st audio frame and emits the sound of the 1 st audio framePosition information D of object1Then, the 1 st audio frame is used as the starting point of the first sounding object sounding content.
Then, the audio collector continuously acquires the 2 nd audio frame and the position information D of the sounding object which sends the 2 nd audio frame2. When the target audio frame is the 2 nd audio frame, the first target audio segment includes the 1 st audio frame. Then the position information D of the 2 nd audio frame2Position information D with 1 st audio frame1And (6) matching. If the position information D2And position information D1If the audio frames are matched with each other, the sound production objects of the 1 st audio frame and the 2 nd audio frame can be determined to be the first sound production objects, and the position information D of the 3 rd audio frame and the sound production object which emits the 3 rd audio frame is continuously acquired3
When the 3 rd audio frame is a target audio frame, for example, the first target audio segment may include a 1 st audio frame and a 2 nd audio frame. Since the 1 st audio frame and the 2 nd audio frame have the same sound-generating object, the second target audio segment may include the 1 st audio frame and the 2 nd audio frame.
Then, the position information D of the 3 rd audio frame is determined3First position information corresponding to the first target audio segment is matched. For example, the first position information corresponding to the first target audio segment may be an average value D' of the position information of the sound-emitting object that emitted the 1 st audio frame and the position information of the sound-emitting object that emitted the 2 nd audio frame. If the position information D of the 3 rd audio frame3And matching with the position information D', determining that the sound production objects of the 1 st audio frame to the 3 rd audio frame are all the first sound production objects.
By analogy, it can be determined which audio frame's sound object is the first sound object according to the above method. Assuming that the sound emission objects of the 1 st to M1 th audio frames are all the first sound emission objects according to the above method, the M1+1 th audio frame, i.e. the target audio frame, and the target position information D of the sound emission object emitting the target audio frame are continuously acquiredM1+1
When the target audio frame is the M1+1 th audio frame, the first target audio segment may include the first N audio frames of the target audio frame, where N is a positive integer greater than 1 or equal to 1. The first target audio segment may include audio frames from the M1-nth audio frame to the M1 th audio frame. Wherein M1 is a positive integer. It should be noted that, if the number of audio frames acquired before the target audio frame is less than N, the first target audio segment includes all audio frames before the target audio frame.
When the target audio frame is the M1+1 th audio frame, the second target audio segment may be all audio frames or a portion of audio frames emitted by the sound-generating object of the first target audio segment. And the second target audio segment comprises the first target audio segment. For example, the second target audio segment includes all audio frames from the 1 st audio frame to the M1 th audio frame.
As an example, M1-1000 and N-500. If the 1001 st audio frame is a target audio frame, the first target audio segment includes audio frames from the 500 th audio frame to the 1000 th audio frame, and the second target audio segment includes audio frames from the 1 st audio frame to the 1000 th audio frame.
For example, the first position information corresponding to the first target audio segment may be an average value D ″ of position information of the sounding object of each audio frame between the 500 th audio frame and the 1000 th audio frame. Next, the position information D of the participant who sent out the 1001 st audio frame1001Matching with the position information D ″. If the position information D1001If the position information D ″ does not match, it may be determined that the sound object from which the 1001 st audio frame is emitted is different from the first sound object from which the 1 st to 1000 th audio frames are emitted, that is, it represents that the speaker in the conference room has switched. The 1000 th audio frame may be used as the end point of the utterance of the first utterance object and the 1001 st audio frame may be used as the start point of the utterance of the second utterance object.
Next, audio frames from the start point to the end point of the utterance content of the first utterance object, i.e., the 1 st audio frame to the 1000 th audio frame, i.e., the second target audio segment, may be extracted. And then, extracting the target voiceprint characteristics of the second target audio segment, and determining the target sound-emitting objects corresponding to the 1 st audio frame to the 1000 th audio frame, namely the first sound-emitting object, based on the target voiceprint characteristics. Because the participant B speaks first, the sound-producing objects corresponding to the 1 st to 1000 th audio frames can be determined as the participant B according to the target voiceprint features of the 1 st to 1000 th audio frames.
The voice print recognition technology is utilized to recognize the voice production object by utilizing the voice print characteristics. The voiceprint recognition technology is a technology for recognizing the identity of a sound-producing object through the voiceprint characteristics of the sound-producing object. Voiceprints refer to the spectrum of sound waves carrying verbal information.
Then, the next target audio frame is continuously obtained, and according to the method, the end point of the second sound-producing object can be obtained. For example, if all the sound-generating objects of the 1001 st to 2000 th audio frames are the second sound-generating objects, and it is determined that the target position information of the sound-generating object from which the 2001 th audio frame is generated does not match the average value of the position information of the sound-generating object from which each of the 1500 th to 2000 th audio frames is generated, the 2000 th audio frame is the end point of the sound-generating content of the second sound-generating object. By extracting the target voiceprint features of the 1001 st audio frame to the 2000 th audio frame, the target sound-emitting objects of the 1001 st audio frame to the 2000 th audio frame can be determined according to the voiceprint features. Since participant a is the second speaking person, it can be determined that the sound-emitting objects of the 1001 st audio frame to the 2000 th audio frame are participant a. Similarly, the audio frame emitted by participant C may also be determined.
In the embodiment of the application, by combining the voiceprint recognition technology and the sound source positioning technology, the sound production object corresponding to the sound production content can be accurately determined.
It should be noted that the method for determining an audio object provided in the embodiment of the present specification may be applied to other scenes, such as different scenes, for example, an audition scene, an interview scene, a classroom scene, and the like, besides the scene for determining an audio object in the meeting room, and only the application to the meeting room scene is taken as an example for description here.
Based on the above application scenarios, the following describes in detail a method for determining a sound object according to an embodiment of the present application with reference to fig. 2.
Fig. 2 is a flow chart of a method for determining an utterance object according to a first aspect of the present application.
As shown in fig. 2, a method 200 for determining an utterance object provided in an embodiment of the present application includes:
step 210, acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
step 220, if the target position information is determined not to match the first position information corresponding to the first target audio segment, extracting target voiceprint characteristics of a second target audio segment, wherein the first target audio segment comprises the first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment; n is an integer greater than or equal to 1;
step 230, determining a target sound-emitting object of the second target audio segment according to the target voiceprint characteristics.
A specific implementation of step 210 will be described first.
In an embodiment of the present application, the voice data stream comprises a series of sample point values having a time sequence. The sample point values are obtained by sampling the original analog sound signal at a particular audio sample rate. A series of sample point values may describe the sound. The audio sampling rate is the number of samples taken in hertz (Hz) within one second. The higher the audio sampling rate, the higher the frequency of the sound waves that can be described. Wherein the audio frame comprises a fixed number of sample point values having a time sequence.
After the sound-producing object sends out the audio signal, the preset audio collector can collect the audio signal. After the audio signal sent by the sounding object is acquired from the audio collector, the target audio signal collected by the audio collector can be subjected to framing processing according to the preset duration, the target audio signal is divided into a plurality of frames of audio signals, and each audio frame sent by the sounding object can be obtained. In the embodiment of the application, the current audio frame acquired by the audio collector is determined as the target audio frame.
In the embodiment of the present application, the position information of the sound emission object that emits the target audio frame can be acquired by using the sound source localization technology. For the description of the sound source localization technique, reference is made to the above description, and the description is not repeated here.
As an example, the audio collector is a microphone array, and when the sound-producing object emits an audio signal, the audio signal can be collected by a microphone in the microphone array. In order to perform real-time processing on the audio signals, after the audio signals acquired by each microphone array are acquired, the target audio signals acquired by each microphone array may be subjected to framing processing according to a preset time duration. And then using the currently obtained audio frame as a target audio frame. Because the distances between each microphone in the microphone array and the sound-producing object are generally different, the time for each microphone in the microphone array to receive the target audio frame is also different, and the target position information of the sound-producing object which sends the target audio frame can be calculated according to the time difference for each microphone to receive the corresponding target audio frame.
The specific implementation of step 220 is described below.
In the embodiment of the present application, when the sound-generating object is switched from the sound-generating object a to the sound-generating object B, the position information of the sound-generating object a and the position information of the sound-generating object B are different from each other because the positions of the sound-generating object a and the sound-generating object B are different from each other. In order to accurately determine whether or not the sound-emitting object is switched, it is necessary to determine whether or not the target position information of the second sound-emitting object from which the target audio frame is emitted matches the position information of the first sound-emitting object from which the previous audio frame of the target audio frame is emitted.
Since the audio frame emitted by the first sound emission object emitting the previous audio frame of the target audio frame may include multiple frames, in order to improve the accuracy of determining whether the sound emission object is switched, the first position information corresponding to the first target audio segment including the previous N audio frames of the target audio frame may be matched with the target position information.
The first target audio segment comprises the first N audio frames of the target audio frame, and the sound production object of each audio frame in the first target audio segment is the same. That is, the sound-generating object of the first target audio segment is the first sound-generating object of the audio frame preceding the emission target audio frame. In other words, the sound production object corresponding to each audio frame in the first target audio segment is the same, i.e. the first sound production object.
It should be noted that, if the number of audio frames emitted by the first sound-emitting object acquired before the target audio frame is less than N, all consecutive audio frames emitted by the first sound-emitting object are taken as the first target audio segment, and the end point of the first target audio segment is the frame before the target audio frame.
In some embodiments, the first position information corresponding to the first target audio segment is determined based on position information of a sound-generating object corresponding to an audio frame in the first target audio segment. For example, the first position information is determined based on an average of position information of the sound-generating object corresponding to each audio frame in the first target audio segment.
For example, the position information of the sound object is an angle between the sound object and a preset microphone array. Then, the first position information is a first included angle corresponding to the first target audio segment, and the first included angle is an average value of included angles between a sound production object corresponding to each audio frame in the first target audio segment and the preset microphone array.
In the embodiment of the present application, whether the target location information matches the first location information may be determined by whether a difference between the target location information and the first location information is within a preset value range. If the difference value is within the preset value range, the target position information is matched with the first position information, and if the difference value exceeds the preset value range, the target position information is not matched with the first position information.
The matching degree between the target position information and the first position information can be represented by the difference value between the target position information and the first position information. That is, the matching degree between the target position information and the first position information is the difference between the target position information and the first position information.
In other embodiments of the present application, the degree of match between the target location information and the first location information may be characterized by a ratio of a difference between the target location information and the first location information to the first location information. And if the ratio of the difference value of the target position information and the first position information to the first position information is within a preset ratio range, representing that the target position information is matched with the first position information. And if the ratio of the difference value of the target position information and the first position information to the first position information is not in the preset ratio range, indicating that the target position information is not matched with the first position information.
The specific implementation manner of determining whether the target location information matches the first location information is not limited.
In the embodiment of the present application, in the case that the target position information does not match the first position information, if the second sound-generating object representing that the target audio frame is generated is different from the sound-generating object of the first target audio segment, the sound-generating object may be switched, and the sound-generating object of the first target audio segment may be further determined by using the voiceprint recognition technology.
Since the first target audio segment may only be a part of the audio signal emitted by the first sound-producing object (i.e. the sound-producing object of the audio frame preceding the target audio frame), when performing the voice stream role separation, the role separation of the second target audio segment emitted by the first sound-producing object is required.
As one example, the second target audio segment includes the first target audio segment, and the voicing object of the audio frame in the second target audio segment is the same as the voicing object of the audio frame in the first target audio segment. That is, the sound generation object of each audio frame in the second target audio segment is the same as the sound generation object of each audio frame in the first target audio segment, i.e., the first sound generation object.
In some embodiments, it may be determined whether the sound-generating object of the target audio frame and the sound-generating object of the first target audio segment may be the same, by determining whether the target position information matches the first position information. If the target position information does not match the first position information, the target audio frame may be considered as a starting point of the sound production content of the second sound production object, that is, a first audio frame to be produced, and a previous audio frame of the target audio frame may be considered as a sound production object of the first target audio segment, that is, an end point of the sound production content of the first sound production object, that is, a last audio frame to be produced. Therefore, according to a similar method, the sound-generating object of the first target audio segment, i.e., the starting point of the sound-generating content of the first sound-generating object, i.e., the first audio frame generated by the first sound-generating object, may also be obtained in advance.
If the target position information does not match the first position information, at least a portion of consecutive audio frames between a first audio frame emitted by the first originating object and an audio frame preceding the target audio frame may be determined as the second target audio segment.
For example, all audio frames between a first audio frame emitted by a sound-generating object of a first target audio segment and an audio frame preceding the target audio frame may be determined as second target audio segments. That is, the start point and the end point of the determined second target audio segment are the occurrence points at which the position information does not match twice before and after, respectively. And if the position information of the sound-producing object is an included angle between the sound-producing object and the microphone array, the determined starting point and the determined end point of the second target audio frequency segment are respectively the occurrence points of angle conversion of the front and the back.
It should be noted that the first target audio segment and the second target audio segment each include a plurality of consecutive audio frames.
When the position information is detected to be changed, determining an audio segment between the occurrence points of the two previous and next position information changes as a second target audio segment, and then extracting the target voiceprint characteristics of the second target audio segment.
In this embodiment of the application, if the target position information matches the first position information, it represents that the second sound-generating object is a sound-generating object of the first target audio segment, that is, the sound-generating object corresponding to the first target audio segment is the same as the sound-generating object corresponding to the target audio frame, and then the next audio frame is re-acquired and is used as the target audio frame, and the target position information of the second sound-generating object that has generated the target audio frame is acquired, that is, the process returns to step 210.
It should be noted that, after the target audio frame is retrieved, the first target audio segment in step 220 is updated, i.e., the first target audio segment includes the last target audio frame, and the first position information corresponding to the first target audio segment is updated accordingly.
The specific implementation of step 230 is described below.
In the embodiment of the present application, a sound database is pre-established, where the sound database includes a correspondence between a sound-generating object and a voiceprint feature, and a correspondence between the sound-generating object and an audio signal.
After the target voiceprint feature of the second target audio segment is obtained, in order to determine the sound generating object of the second target audio segment, the target voiceprint feature needs to be matched with each voiceprint feature in the preset sound database, and the sound generating object corresponding to the voiceprint feature matched with the target voiceprint feature in the preset sound database is determined as the target sound generating object corresponding to the second target audio segment. Moreover, the second target audio segment may be determined as the audio signal corresponding to the target sound-emitting object, that is, the second target audio segment may be added as the audio signal corresponding to the target sound-emitting object.
In some embodiments of the present application, step 230 comprises: under the condition that a first voiceprint feature with the matching degree meeting a first preset matching condition exists in a preset sound database, determining a sound production object corresponding to the first voiceprint feature as a target sound production object corresponding to a second target audio segment; and under the condition that the matching degree of the first voiceprint features and the target voiceprint features meets a second preset matching condition, updating the corresponding voiceprint features of the target sound production object in a preset sound database by using the target voiceprint features.
And the matching degree which needs to be met corresponding to the second preset matching condition is greater than the matching degree which needs to be met corresponding to the first preset matching condition.
In the embodiment of the application, only when the matching degree of the first voiceprint feature and the target voiceprint feature meets the second preset matching condition, the voiceprint feature corresponding to the target sound object in the preset sound database is updated by using the target voiceprint feature, so that the richness and the accuracy of the voiceprint feature corresponding to the target sound object can be improved, and the accuracy of voiceprint recognition is improved.
As an example, the first preset matching condition is that the degree of matching between the voiceprint features in the preset sound database and the target voiceprint features is greater than 80%, and the second preset matching condition is that the degree of matching between the voiceprint features in the preset sound database and the target voiceprint features is greater than 90%.
In the embodiment of the application, when the matching degree of the first voiceprint feature and the target voiceprint feature does not satisfy the condition of the second preset matching condition, the corresponding voiceprint feature of the target sound production object in the preset sound database is not updated by using the target voiceprint feature, and only the second target audio segment is determined as the audio signal sent by the target sound production object.
In an embodiment of the present application, if there are a plurality of first voiceprint features, the sound generation object corresponding to the first voiceprint feature with the highest matching degree with the target voiceprint feature is determined as the sound generation object corresponding to the second target audio segment.
In some embodiments of the present application, in a case that a matching degree between the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second location preset matching degree threshold, the first preset matching condition includes that a matching degree of a voiceprint feature in the preset sound database and the target voiceprint feature is larger than a first preset voiceprint matching degree threshold; the second preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a second preset voiceprint matching degree threshold, wherein the second preset voiceprint matching degree threshold is larger than the first preset voiceprint matching degree threshold.
Under the condition that the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the first preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a third preset voiceprint matching degree threshold value; the second preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a fourth preset voiceprint matching degree threshold.
The fourth preset voiceprint matching degree threshold is larger than the third preset voiceprint matching degree threshold, the first preset voiceprint matching degree threshold is smaller than the second preset voiceprint matching degree threshold, the second preset voiceprint matching degree threshold is smaller than the fourth preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold.
In the embodiment of the application, when the matching degree of the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second preset location matching degree threshold, the method is used for representing that the target location information is not matched with the first location information, but the target location information is more approximate to the first location information. When the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the target position information is not matched with the first position information, and the difference between the target position information and the first position information is large.
When the target position information is closer to the first position information, then the utterance object representing the second utterance object and the first target audio segment may be the same, i.e., the utterance object of the target audio frame and the utterance object of the second target audio segment may be the same person, and thus the threshold for voiceprint matching may be set slightly lower. When the target position information is significantly different from the first position information, then the voicing object representing the second voicing object and the first target audio segment may be different, i.e., the voicing object for the target audio frame and the voicing object for the second target audio segment may not be the same person, so the threshold for voiceprint matching may be set slightly higher. Namely, the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold, and the second preset voiceprint matching degree threshold is smaller than the fourth preset voiceprint matching degree threshold.
In some embodiments of the present application, a method for determining an utterance object provided by an embodiment of the present application further includes: under the condition that the matching degree of each voiceprint feature and the target voiceprint feature in the preset sound database meets a third preset matching condition, storing the corresponding relation between the target voiceprint feature and the sound object corresponding to the target voiceprint feature in the preset sound database, and determining the sound object corresponding to the target voiceprint feature as the target sound object of the second target audio segment; and the third preset matching condition is used for representing that the voiceprint features in the preset sound database are not matched with the target voiceprint features.
In an embodiment of the present application, when each voiceprint feature in the preset sound database does not match the target voiceprint feature, it indicates that the sound generation object corresponding to each voiceprint feature in the preset sound database is not the sound generation object corresponding to the second target audio segment.
In some embodiments, the sound-generating object corresponding to the target voiceprint feature may be determined from a pre-established correspondence relationship between the voiceprint feature and the sound-generating object. And then updating the target voiceprint characteristics and the corresponding relation between the sound production objects corresponding to the target voiceprint characteristics to a preset sound database. That is, the sound-generating object corresponding to the target voiceprint feature is registered in the preset sound database.
In other embodiments of the present application, a method for determining an utterance object provided by an embodiment of the present application further includes: and under the condition that the matching degree of each voiceprint feature in the preset voice database and the target voiceprint feature meets the fourth preset matching condition but does not meet the third preset matching condition, discarding the second target voice frequency segment.
The fourth preset matching condition is also used for representing that the voiceprint features in the preset sound database are not matched with the target voiceprint features, but the degree of mismatching which needs to be met corresponding to the fourth preset matching condition is smaller than the degree of mismatching which needs to be met corresponding to the third preset matching condition.
That is, when the matching degree between each voiceprint feature in the preset sound database and the target voiceprint feature satisfies the fourth preset matching condition but does not satisfy the third preset matching condition, in order to improve the accuracy of subsequent determination of the sound object, the target voiceprint feature of the second target audio segment and the sound object corresponding to the target voiceprint feature are not registered.
In some embodiments of the present application, in a case that a matching degree between the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second preset location matching degree threshold, the third preset matching condition includes that a matching degree of a voiceprint feature in the preset sound database and the target voiceprint feature is smaller than a fifth preset voiceprint matching degree threshold.
And under the condition that the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the third preset matching condition comprises that the matching degree between the voiceprint features in the preset sound database and the target voiceprint features is smaller than a sixth preset voiceprint matching degree threshold value.
And the fifth preset voiceprint matching degree threshold is smaller than the sixth preset voiceprint matching degree threshold.
In some embodiments of the present application, in a case that the matching degree between the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second preset location matching degree threshold, the fourth preset matching condition is that the matching degree between the voiceprint feature in the preset sound database and the target voiceprint feature is smaller than a seventh preset voiceprint matching degree threshold.
And under the condition that the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the fourth preset matching condition is that the matching degree between the voiceprint features in the preset sound database and the target voiceprint features is smaller than an eighth preset voiceprint matching degree threshold value.
And the seventh preset voiceprint matching degree threshold is smaller than the eighth preset voiceprint matching degree threshold.
When the target position information is similar to the first position information, it means that the sound-generating object corresponding to the second sound-generating object and the first target audio segment may be the same, that is, the sound-generating object of the target audio frame and the sound-generating object of the second target audio segment may be the same person, so that the mismatch threshold for voiceprint matching may be set slightly lower to improve the accuracy of determining the sound-generating object. When the target position information is significantly different from the first position information, then the utterance object representing the second utterance object corresponding to the first target audio segment may be different, i.e., the utterance object of the target audio frame and the utterance object of the second target audio segment may not be the same person, and thus the voiceprint mismatch threshold may be set slightly higher. That is, the fifth preset voiceprint matching degree threshold is smaller than the sixth preset voiceprint matching degree threshold, and the seventh preset voiceprint matching degree threshold is smaller than the eighth preset voiceprint matching degree threshold. By such an arrangement, the accuracy of determination of the sound-emitting object can be improved.
That is, in the case where the target location does not match the first location information, the matching degree between the target location information and the first location information is smaller than the first preset location matching degree threshold and larger than the second preset location matching degree threshold, and the matching degree between the target location information and the first location information is smaller than the second preset location matching degree threshold, which may represent two degrees of mismatch. The matching degree between the target position information and the first position information is smaller than a first preset position matching degree threshold value and larger than a second preset position matching degree threshold value, the fact that the degree of mismatching between the target position information and the first position information is slightly lower is represented, and the fact that the degree of mismatching between the target position information and the first position information is smaller than the second preset position matching degree threshold value represents that the degree of mismatching between the target position information and the first position information is slightly higher is represented.
Aiming at two conditions of slightly low mismatching degree and slightly high mismatching degree of the target position information and the first position information, four preset voiceprint matching degree threshold values are corresponding to the voiceprint matching. When the mismatching degree of the target position information and the first position information is slightly low, the target position information is represented to be approximate to the first position information, and in this case, the corresponding four thresholds are a first preset voiceprint matching degree threshold, a second preset voiceprint matching degree threshold, a fifth preset voiceprint matching degree threshold and a seventh preset voiceprint matching degree threshold. When the mismatching degree of the target position information and the first position information is slightly higher, the difference between the target position information and the first position information is larger, and under the condition, the four corresponding voiceprint matching thresholds are a third preset voiceprint matching threshold, a fourth preset voiceprint matching threshold, a sixth preset voiceprint matching threshold and an eighth preset voiceprint matching threshold. And when the degree of mismatching between the target position information and the first position information is slightly lower, the corresponding four preset matching degree threshold values are all smaller than the corresponding four preset matching degree threshold values respectively when the degree of mismatching between the target position information and the first position information is slightly lower. That is, the first preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold, the second preset voiceprint matching degree threshold is smaller than the fourth preset voiceprint matching degree threshold, the fifth preset voiceprint matching degree threshold is smaller than the sixth preset voiceprint matching degree threshold, and the seventh preset voiceprint matching degree threshold is smaller than the eighth preset voiceprint matching degree threshold.
It should be noted that the seventh preset voiceprint matching degree threshold is smaller than the first preset voiceprint matching degree threshold, and the eighth preset voiceprint matching degree threshold is smaller than the third preset voiceprint matching degree threshold.
Fig. 3 shows a schematic flow chart of voiceprint matching provided in the embodiment of the present application. For example, the position information of the sound-generating object is an angle with the microphone array, and the target position information is a target angle between the second sound-generating object and the microphone array. The first position information is a first angle corresponding to the first target audio segment.
Referring to fig. 3, when the matching degree between the target angle and the first angle is smaller than the first preset position matching degree threshold and larger than the second preset position matching degree threshold, that is, the target angle is similar to the first angle, and the voiceprint comparison is performed by using the lower preset voiceprint matching degree threshold.
That is to say, when the preset sound database has the first voiceprint feature whose matching degree with the target voiceprint feature satisfies that the first voiceprint feature is greater than the second preset voiceprint matching degree threshold, it is determined that the probability that the sound object corresponding to the first voiceprint feature is the same as the sound object of the second target audio segment is very high, and the voiceprint feature corresponding to the target sound object is updated by using the target voiceprint feature.
And under the condition that the matching degree of the preset sound database with the target voiceprint features meets the first voiceprint features which are greater than a first preset voiceprint matching degree threshold and less than a second preset voiceprint matching degree threshold, judging that the probability that the sound object corresponding to the first voiceprint features is the same as the sound object of the second target audio segment is higher, but the probability is lower than the condition that the matching degree is greater than the second preset voiceprint matching degree threshold, and therefore, not utilizing the target voiceprint features to update the voiceprint features corresponding to the target sound object.
And if the matching degree of each voiceprint feature and the target voiceprint feature in the preset voice database is smaller than a fifth preset voiceprint matching degree threshold value, judging that the probability that the sound-producing object corresponding to each voiceprint feature in the preset voice database is not the sound-producing object corresponding to the target voiceprint feature is very high, and storing the corresponding relation between the target voiceprint feature and the sound-producing object corresponding to the target voiceprint feature in the preset voice database, namely registering the target voiceprint feature and the sound-producing object corresponding to the target voiceprint feature.
And if the matching degree of each voiceprint feature in the preset voice database and the target voiceprint feature is smaller than the seventh preset voiceprint matching degree threshold and larger than the fifth preset voiceprint matching degree threshold, judging that the probability that the sound-emitting object corresponding to each voiceprint feature in the preset voice database is not the sound-emitting object corresponding to the target voiceprint feature is higher, but is lower than the condition that the matching degree is smaller than the fifth preset voiceprint matching degree threshold, discarding the second target voice frequency segment, and not registering the target voiceprint feature and the sound-emitting object corresponding to the target voiceprint feature.
Continuing with fig. 3, when the matching degree between the target angle and the first angle is smaller than the second preset position matching degree threshold value, that is, the target angle is different from the first angle, a higher preset voiceprint matching degree threshold value is used for voiceprint comparison.
That is to say, when the preset sound database has the first voiceprint feature whose matching degree with the target voiceprint feature satisfies that the first voiceprint feature is greater than the fourth preset voiceprint matching degree threshold, it is determined that the probability that the sound object corresponding to the first voiceprint feature is the same as the sound object of the second target audio segment is very high, and the voiceprint feature corresponding to the target sound object is updated by using the target voiceprint feature.
And under the condition that the matching degree of the preset sound database with the target voiceprint features meets the first voiceprint features which are more than a third preset voiceprint matching degree threshold and less than a fourth preset voiceprint matching degree threshold, judging that the probability that the sound object corresponding to the first voiceprint features is the same as the sound object of the second target audio segment is higher, but the probability is lower than the condition that the matching degree is more than the fourth preset voiceprint matching degree threshold, and therefore, not utilizing the target voiceprint features to update the voiceprint features corresponding to the target sound object.
And if the matching degree of each voiceprint feature and the target voiceprint feature in the preset voice database is smaller than the eighth preset matching degree threshold value, judging that the probability that the sound object corresponding to each voiceprint feature in the preset voice database is not the sound object corresponding to the target voiceprint feature is very high, and storing the corresponding relation between the target voiceprint feature and the sound object corresponding to the target voiceprint feature in the preset voice database, namely registering the target voiceprint feature and the sound object corresponding to the target voiceprint feature.
And if the matching degree of each voiceprint feature in the preset voice database and the target voiceprint feature is smaller than the tenth preset matching degree threshold and larger than the eighth preset matching degree threshold, judging that the probability that the sound production object corresponding to each voiceprint feature in the preset voice database is not the sound production object corresponding to the target voiceprint feature is higher, but the probability is lower than the situation that the matching degree is smaller than the sixth preset voiceprint matching degree threshold, discarding the second target voice frequency segment, and not registering the target voiceprint feature and the sound production object corresponding to the target voiceprint feature.
In the embodiment of the application, by combining the matching degree between the target position information and the first position information and performing voiceprint matching by using two sets of preset voiceprint matching degree thresholds, the accuracy of determining the sound object can be realized.
In some embodiments of the present application, in order to improve the accuracy of determining the sound generating object, before step 220, the sound generating object determining method provided by an embodiment of the present application further includes: filtering the position information of a sound production object corresponding to the audio frame in the first target audio segment to obtain filtered position information; based on the filtered location information, first location information is determined.
In some embodiments, the position information of the sounding object corresponding to each audio frame in the first target audio segment may be filtered by a median filter, so as to obtain filtered position information of the sounding object corresponding to each audio frame.
The idea of median filtering is that the position information of the sound-generating object corresponding to each audio frame can be replaced by a statistical median of the position information of the sound-generating objects corresponding to all audio frames in a neighborhood of a preset size of the audio frame.
As an example, if the position information of the sound generating object is an angle between the sound source and the microphone array, the first position information is an average value of the filtered angles of the sound generating objects corresponding to each audio frame.
In the embodiment of the application, the position information of the sound-producing object corresponding to each audio frame in the first target audio segment is filtered, so that some noises and burrs can be filtered, the position information is more stable, and the smooth position information is obtained, thereby improving the accuracy of determining the sound-producing object.
In other embodiments of the present application, the position information of the sound-generating object corresponding to each audio frame in the first target audio segment may be filtered in other manners, for example, an averaging filter may be used.
Fig. 4 is a flowchart illustrating a method for determining an utterance object according to a second aspect of the present application. As shown in fig. 4, the second aspect of the present application provides a sound emission target determination method 400 including:
step 410, acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
step 420, if it is determined that the target position information does not match the first position information corresponding to the first target audio segment, extracting a target voiceprint feature of the first target audio segment, where the first target audio segment includes all consecutive audio frames emitted by the first sound-emitting object, and the first sound-emitting object is a sound-emitting object of a previous audio frame from which the target audio frame was emitted; the end point of the first target audio segment is a previous audio frame of the target audio frame;
step 430, determining a target sound-generating object of the first target audio segment according to the target voiceprint characteristics.
In the embodiment of the present application, the specific implementation manner of step 410 is similar to that of step 210, and is not described herein again.
In the embodiment of the present application, the specific implementation manner of step 420 is similar to that of step 220. In step 420, the difference from step 220 is that the first target audio segment includes all consecutive audio frames emitted by the first sound-emitting object, and the first sound-emitting object is a sound-emitting object of an audio frame previous to the emitted target audio frame; the end point of the first target audio segment is an audio frame preceding the target audio frame.
Whereas in step 220 the first target audio segment is the first N audio frames comprising the first originating object originating before the target audio frame, i.e. not necessarily all consecutive audio frames originating from the first originating object.
In the embodiment of the application, since the position information included in the first position information corresponding to all the continuous audio frames emitted by the first sound-emitting object is richer, the first position information can more accurately represent the position information of the first sound-emitting object. Therefore, the first position information is matched with the target position information, whether the sounding object is switched or not can be judged more accurately, and the accuracy of role separation can be improved.
In the embodiment of the present application, a specific implementation manner of step 430 is similar to that of step 230, and the identity of the first sound-emitting object may be determined according to the target voiceprint feature of the first target audio segment, which is not described herein again.
In an embodiment of the present invention, in a case where it is determined that target position information of a second sound-emitting object from which the target audio frame is emitted does not match first position information corresponding to the first target audio segment, it may be determined that the first sound-emitting object from which the first target audio segment is emitted and the second sound-emitting object from which the target audio frame is emitted are different. And then, extracting the target voiceprint characteristics of the first target audio segment, and determining the target sound production object of the second target audio segment according to the target voiceprint characteristics, namely determining the identity of the first sound production object, so as to realize role separation. By combining the voiceprint recognition technology and the sound source positioning technology, the accuracy of determining the sounding object can be improved, and therefore the accuracy of determining the sounding object is improved.
In an embodiment of the present application, a specific implementation of the method for determining a sounding object provided in the second aspect is similar to that of the method for determining a sounding object provided in the first aspect, and is not described herein again.
In the embodiment of the present application, if the role separation is to be implemented, it is necessary to determine the starting point and the ending point of the vocalized content of each vocalized object to separate the vocalized content of each vocalized object, so as to implement the role separation. Fig. 5 is a flowchart illustrating a method for determining a starting point of utterance content according to a third aspect of the present application. As shown in fig. 5, a method 500 for determining a starting point of utterance content provided in a third aspect of the present application includes:
step 510, acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
step 520, determining that the target position information is not matched with the first position information corresponding to the first target audio segment, and determining the target audio frame as a starting point of the sound production content of the second sound production object;
the first target audio segment comprises all continuous audio frames emitted by a first sound-emitting object, and the first sound-emitting object is a sound-emitting object of an audio frame before the target audio frame is emitted; the end point of the first target audio segment is an audio frame preceding the target audio frame.
In the embodiment of the present application, the specific implementation manner of step 510 is similar to that of step 210, and is not described herein again.
In step 520, in the case that it is determined that the target position information does not match the first position information corresponding to the first target audio segment, it is determined that the sound generating object generates a switch, that is, the sound generating object is switched from the first sound generating object to the second sound generating object. Therefore, the previous audio frame of the target audio frame may be used as the end point of the utterance content of the first utterance object, and the target audio frame may be used as the start point of the utterance content of the second utterance object, so as to extract the entire utterance content of the second utterance object subsequently.
In the embodiment of the application, whether the sound-producing object is switched or not can be determined by matching the target position information of the sound-producing object of the target audio frame with the first position information corresponding to all the continuous audio frames of the first sound-producing object, so that the starting point and the end point of the sound-producing content of each sound-producing object can be determined, and the role separation is realized.
In some scenarios, different sound objects may sound, for example, in a meeting room scenario, different participants may speak. In order to improve the efficiency of the conference, the current sound-emitting object can be prompted so that other users can know the identity of the current sound-emitting object. Therefore, it is desirable to provide a method for changing the identification of a sound object to prompt the identity of the current sound object. Fig. 6 is a flowchart illustrating a method for changing an identification of a sound-generating object according to a fourth aspect of the present application. As shown in fig. 6, a sound emission target identification changing method 600 according to a fourth aspect of the present application includes:
step 610, acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
step 620, if it is determined that the target position information does not match the first position information corresponding to the first target audio segment, extracting target voiceprint characteristics of a second target audio segment, wherein the first target audio segment includes the first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
step 630, determining a target sound-generating object of the second target audio segment according to the target voiceprint characteristics;
and step 640, changing the identification of the second sound-emitting object and the identification of the target sound-emitting object, wherein the identifications are used for representing the sound-emitting state of the sound-emitting objects.
In the embodiment of the present application, a specific implementation manner of step 610 is similar to a specific implementation manner of step 210, a specific implementation manner of step 620 is similar to a specific implementation manner of step 220, and a specific implementation manner of step 630 is similar to a specific implementation manner of step 230, and is not described herein again.
In an embodiment of the present application, in a case where it is determined that the target position information does not match the first position information corresponding to the first target audio piece, it may be determined that the sound emission object is switched. That is, the sound-generating object is switched from the target sound-generating object of the second target audio segment to the second sound-generating object of the target audio frame.
Thus, the identity of the second sound-generating object and the identity of the target sound-generating object may be altered to prompt the sound-generating object to switch from the target sound-generating object to the second sound-generating object. The identification of the sound-emitting object is used for representing the sound-emitting state of the sound-emitting object.
As one example, the identification of the sound emitting object may be the brightness of the image of the sound emitting object. For example, when the brightness of the image of the sound-generating object is the first preset brightness, the method is used for identifying that the sound-generating object is currently in a sound-generating state. And if the brightness of the image of the sound-producing object is the second preset brightness, the method is used for identifying that the sound-producing object is in the non-sound-producing state currently.
In the embodiment of the application, if the current sound-generating object is switched from the target sound-generating object to the second sound-generating object, the brightness of the image of the target sound-generating object is changed from the first preset brightness to the second preset brightness, so as to represent that the target sound-generating object stops generating sound. The brightness of the image of the second sound-producing object is changed from the second preset brightness to the first preset brightness, and the brightness is used for representing that the second sound-producing object starts to produce sound.
In other embodiments of the present application, the identification of the sound-emitting object may be a label of the sound-emitting object. For example, when the label of the sound-emitting object is the first preset label, the first preset label is used for identifying that the sound-emitting object is currently in a sound-emitting state. And if the label of the sound-producing object is the second preset label, the sound-producing object is used for identifying that the sound-producing object is in the non-sound-producing state at present.
In the embodiment of the application, if the current sound-emitting object is switched from the target sound-emitting object to the second sound-emitting object, the label of the target sound-emitting object is changed from the first preset label to the second preset label, so as to represent that the target sound-emitting object stops emitting sound. And the label of the second sound-emitting object is changed from a second preset label to a first preset label, and the second preset label is used for representing that the second sound-emitting object starts to emit sound.
In an embodiment of the present application, by using the target position information of the sound-generating object of the target audio frame to match with the first position information corresponding to the first target audio segment, it may be determined whether the second sound-generating object is the same as the sound-generating object of the second target audio segment, i.e., whether a switch occurs in the current sound-generating object. And under the condition that the switching of the sound-emitting objects is determined, the identification of the second sound-emitting object and the identification of the target sound-emitting object are changed, so that the identity of the current sound-emitting object can be prompted.
In some conversation scenarios, after acquiring the audio conversation data in the conversation scenario, the audio conversation data needs to be processed to obtain a conversation record, so as to record the content of the conversation. Accordingly, the present application provides a session record generation method. Fig. 7 is a flowchart illustrating a session record generation method according to a fifth aspect of the present application. As shown in fig. 7, a session record generating method 700 provided by the fifth aspect of the present application includes:
step 710, acquiring a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object in the audio session data;
step 720, if the target position information is determined not to match the first position information corresponding to the first target audio segment in the audio session data, extracting the target voiceprint characteristics of the second target audio segment in the audio session data, wherein the first target audio segment comprises the first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
step 730, determining a target sound production object of a second target audio segment according to the target voiceprint characteristics;
step 740, associating the target sound-producing object with the text content corresponding to the second target audio segment to obtain a conversation record of the target sound-producing object.
In the embodiment of the present application, a specific implementation manner of step 710 is similar to that of step 210, a specific implementation manner of step 720 is similar to that of step 220, and a specific implementation manner of step 730 is similar to that of step 230, and therefore, details are not described herein again.
It should be noted that the audio collector may be used to collect audio session data in the session scene. In order to facilitate generation of a conversation record, in the process of collecting audio conversation data, position information of a sound emission object of each audio frame in the audio conversation data is also acquired.
In an embodiment of the application, when it is determined that the target position information does not match the first position information corresponding to the first target audio segment in the audio session data, it is determined that the utterance object of the target audio frame is different from the utterance object of the first target audio segment, and the utterance content of the utterance object of the first target audio segment needs to be extracted. Due to the switching of the utterance object, the target audio frame may be used as a start point of the utterance content of the second utterance object, and a frame preceding the target audio frame may be used as an end point of the utterance content of the utterance object of the first target audio segment. With regard to the relationship between the first target audio segment and the second target audio segment, reference may be made to the statements of embodiments of the utterance object determination method provided in the first aspect.
And after the target sound production object of the second target audio segment is determined based on the target voiceprint characteristics, associating the text content corresponding to the second target audio segment with the target sound production object, thereby obtaining the conference record of the target sound production object.
In the embodiment of the application, the second target audio segment corresponding to each sound-emitting object in the audio session data can be extracted by the method, so that the conference record of the audio session data can be obtained.
To improve the integrity of the conference recording, the second target audio segment may include all of the successive audio frames emitted by the first originating object that is an originating object from an audio frame that is prior to the emission of the target audio frame.
In the embodiment of the application, when it is determined that the target position information does not match the first position information corresponding to the first target audio segment in the audio session data, it is determined that the sound generating object of the target audio frame is different from the sound generating object of the first target audio segment, so that the text content corresponding to the second target audio segment which is the same as the sound generating object of the first target audio segment can be associated with the target object, thereby forming a conference record, so as to extract the record of the audio session data in the subsequent process, and improve convenience.
In an embodiment of the present application, an execution subject of the sound emission target determination method provided in the embodiment of the present application may be a sound emission target determination device. In the embodiments of the present application, the sound-generating object specifying device provided in the embodiments of the present application will be described by taking the sound-generating object specifying method executed by the sound-generating object specifying device as an example.
Fig. 8 is a schematic structural diagram of the sound emission target determination apparatus provided in the sixth aspect. As shown in fig. 8, the sound emission target determination device 800 includes:
an obtaining module 810, configured to obtain a target audio frame emitted by a second sound-generating object and target position information of the second sound-generating object;
an extraction module 820, configured to extract a target voiceprint feature of a second target audio segment if it is determined that the target position information does not match first position information corresponding to a first target audio segment, where the first target audio segment includes first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment; n is an integer greater than or equal to 1;
the first determining module 830 is configured to determine a target sound-generating object of the second target audio segment according to the target voiceprint feature.
According to the embodiment of the present invention, in the case where it is determined that the target position information of the second sound-emitting object from which the target audio frame is emitted does not match the first position information corresponding to the first target audio segment, it may be determined that the first sound-emitting object from which the first target audio segment is emitted is different from the second sound-emitting object from which the target audio frame is emitted. And then, extracting the target voiceprint characteristic of a second target audio segment emitted by the first sound-emitting object, and determining the target sound-emitting object of the second target audio segment according to the target voiceprint characteristic, namely determining the identity of the first sound-emitting object so as to realize role separation. By combining the voiceprint recognition technology and the sound source positioning technology, the accuracy of determining the sounding object can be improved, and therefore the accuracy of determining the sounding object is improved.
In some embodiments of the present application, the target position information includes relative position information between the second sound-emitting object and a preset audio collector.
In some embodiments of the present application, the sound emission target determination device 800 further includes:
the filtering module is used for filtering the position information of the sounding object corresponding to the audio frame in the first target audio segment to obtain the filtered position information;
a second determination module to determine the first location information based on the filtered location information.
In some embodiments of the present application, the first determining module 830 is configured to:
under the condition that a first voiceprint feature with the matching degree meeting a first preset matching condition exists in a preset sound database, determining a sound production object corresponding to the first voiceprint feature as a target sound production object of a second target audio segment;
under the condition that the matching degree of the first voiceprint features and the target voiceprint features meets a second preset matching condition, the corresponding voiceprint features of the target sound production object in a preset sound database are updated by using the target voiceprint features;
and the matching degree which needs to be met corresponding to the second preset matching condition is greater than the matching degree which needs to be met corresponding to the first preset matching condition.
In some embodiments of the present application, in a case that a matching degree between the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second preset location matching degree threshold, the first preset matching condition includes that a matching degree of a voiceprint feature in the preset sound database and the target voiceprint feature is larger than a first preset voiceprint matching degree threshold; the second preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a second preset voiceprint matching degree threshold, wherein the second preset voiceprint matching degree threshold is larger than the first preset voiceprint matching degree threshold.
Under the condition that the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the first preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a third preset voiceprint matching degree threshold value; the second preset matching condition comprises that the matching degree of the voiceprint features in the preset sound database and the target voiceprint features is larger than a fourth preset voiceprint matching degree threshold.
The fourth preset voiceprint matching degree threshold is larger than the third preset voiceprint matching degree threshold, the first preset voiceprint matching degree threshold is smaller than the second preset voiceprint matching degree threshold, and the second preset voiceprint matching degree threshold is smaller than the fourth preset voiceprint matching degree threshold.
In some embodiments of the present application, the sound emission target determination device 400 further includes:
the processing module is used for storing the corresponding relation between the target voiceprint characteristics and the sound production objects corresponding to the target voiceprint characteristics in the preset sound database under the condition that the matching degree of each voiceprint characteristic in the preset sound database and the target voiceprint characteristics meets a third preset matching condition, and determining the sound production objects corresponding to the target voiceprint characteristics as the target sound production objects of the second target audio band;
and the third preset matching condition is used for representing that the voiceprint features in the preset sound database are not matched with the target voiceprint features.
In some embodiments of the present application, in a case that a matching degree between the target location information and the first location information is smaller than a first preset location matching degree threshold and larger than a second preset location matching degree threshold, the third preset matching condition includes that a matching degree of a voiceprint feature in the preset sound database and the target voiceprint feature is smaller than a fifth preset voiceprint matching degree threshold.
And under the condition that the matching degree between the target position information and the first position information is smaller than a second preset position matching degree threshold value, the third preset matching condition comprises that the matching degree between the voiceprint features in the preset sound database and the target voiceprint features is smaller than a sixth preset voiceprint matching degree threshold value.
And the fifth preset voiceprint matching degree threshold is smaller than the sixth preset voiceprint matching degree threshold.
Other details of the apparatus 800 for determining a sound generating object according to the embodiment of the present invention are similar to those of the method for determining a sound generating object provided in the first aspect, and are not described herein again.
Fig. 9 is a schematic structural diagram of a sound emission target determination device according to a seventh aspect. As shown in fig. 9, the sound emission target determination device 900 includes:
an obtaining module 910, configured to obtain a target audio frame emitted by a second sound-generating object and target position information of the second sound-generating object;
an extracting module 920, configured to determine that the target position information does not match first position information corresponding to a first target audio segment, and extract a target voiceprint feature of the first target audio segment, where the first target audio segment includes all consecutive audio frames emitted by a first sound-emitting object, and the first sound-emitting object is a sound-emitting object of a previous audio frame from which the target audio frame is emitted; the end point of the first target audio segment is a previous audio frame of the target audio frame;
a first determining module 930 configured to determine a target sound-generating object of the first target audio segment according to the target voiceprint characteristics.
In an embodiment of the present invention, in a case where it is determined that target position information of a second sound-emitting object from which the target audio frame is emitted does not match first position information corresponding to the first target audio segment, it may be determined that the first sound-emitting object from which the first target audio segment is emitted and the second sound-emitting object from which the target audio frame is emitted are different. And then, extracting the target voiceprint characteristics of the first target audio segment, and determining the target sound production object of the second target audio segment according to the target voiceprint characteristics, namely determining the identity of the first sound production object, so as to realize role separation. By combining the voiceprint recognition technology and the sound source positioning technology, the accuracy of determining the sounding object can be improved, and therefore the accuracy of determining the sounding object is improved.
Other details of the apparatus 900 for determining a sound generating object according to the embodiment of the present invention are similar to those of the method for determining a sound generating object provided in the second aspect, and are not described herein again.
Fig. 10 is a schematic structural diagram of a spoken content origin determining apparatus provided in the eighth aspect. As shown in fig. 10, the apparatus 1000 for determining the starting point of utterance content includes:
an obtaining module 1010, configured to obtain a target audio frame emitted by a second sound-generating object and target position information of the second sound-generating object;
a first determining module 1020, configured to determine that the target position information does not match the first position information corresponding to the first target audio segment, and determine the target audio frame as a starting point of the utterance content of the second utterance object;
the first target audio segment comprises all continuous audio frames emitted by a first sound-emitting object, and the first sound-emitting object is a sound-emitting object of an audio frame before the target audio frame is emitted; the end point of the first target audio segment is an audio frame preceding the target audio frame
In the embodiment of the application, whether the sound-producing object is switched or not can be determined by matching the target position information of the sound-producing object of the target audio frame with the first position information corresponding to all the continuous audio frames of the first sound-producing object, so that the starting point and the end point of the sound-producing content of each sound-producing object can be determined, and the role separation is realized.
Other details of the apparatus 1000 for determining a sound generating object according to the embodiment of the present invention are similar to the method for determining a starting point of sound generating content provided in the third aspect above, and are not described herein again.
Fig. 11 is a schematic structural diagram of a sound object identifier changing apparatus provided in the ninth aspect. As shown in fig. 11, the sound generation target identification changing apparatus 1100 includes:
an obtaining module 1110, configured to obtain a target audio frame emitted by a second sound-emitting object and target position information of the second sound-emitting object;
an extraction module 1120, configured to extract a target voiceprint feature of a second target audio segment if it is determined that the target location information does not match first location information corresponding to a first target audio segment, where the first target audio segment includes a first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
a first determining module 1130, configured to determine a target sound-generating object of the second target audio segment according to the target voiceprint feature;
and an altering module 1140, configured to alter the identity of the second sound-generating object and the identity of the target sound-generating object, and identify the sound-generating state used for representing the sound-generating object.
In an embodiment of the present application, by using the target position information of the sound-generating object of the target audio frame to match with the first position information corresponding to the first target audio segment, it may be determined whether the second sound-generating object is the same as the sound-generating object of the second target audio segment, i.e., whether a switch occurs in the current sound-generating object. And under the condition that the switching of the sound-emitting objects is determined, the identification of the second sound-emitting object and the identification of the target sound-emitting object are changed, so that the identity of the current sound-emitting object can be prompted.
Other details of the apparatus 1100 for changing an identification of a sound generating object according to the embodiment of the present invention are similar to those of the method for changing an identification of a sound generating object provided in the fourth aspect, and are not described herein again.
Fig. 12 is a schematic structural diagram of a session record generation apparatus according to the tenth aspect. As shown in fig. 12, the session record generation apparatus 1200 includes:
an obtaining module 1210, configured to obtain a target audio frame emitted by a second sound-generating object in the audio session data and target position information of the second sound-generating object;
an extracting module 1220, configured to determine that the target position information does not match the first position information corresponding to the first target audio segment in the audio session data, and extract a target voiceprint feature of a second target audio segment in the audio session data, where the first target audio segment includes the first N audio frames of the target audio frame; the sound production object of the audio frame in the first target audio segment is the same as the sound production object of the audio frame in the second target audio segment; the first target audio segment is at least a portion of the second target audio segment;
a first determining module 1230, configured to determine a target sound generation object of the second target audio segment according to the target voiceprint feature;
and the association module 1240 is configured to associate the target sound-generating object with the text content corresponding to the second target audio segment, so as to obtain a session record of the target sound-generating object.
In the embodiment of the application, when it is determined that the target position information does not match the first position information corresponding to the first target audio segment in the audio session data, it is determined that the sound generating object of the target audio frame is different from the sound generating object of the first target audio segment, so that the text content corresponding to the second target audio segment which is the same as the sound generating object of the first target audio segment can be associated with the target object, thereby forming a conference record, so as to extract the record of the audio session data in the subsequent process, and improve convenience.
Other details of the session record generating apparatus 1200 according to the embodiment of the present invention are similar to the session record generating method provided in the fourth aspect above, and are not described herein again.
The methods provided by any of the first, second, third, fourth, and fifth aspects and the apparatus provided by any of the sixth, seventh, eighth, ninth, and tenth aspects described in conjunction with fig. 2-12 may be implemented by a computing device. Fig. 13 is a hardware configuration diagram of a computing device 1300 according to an embodiment of the invention.
As shown in fig. 13, computing device 1300 includes an input device 1301, an input interface 1302, a processor 1303, a memory 1304, an output interface 1305, and an output device 1306. The input interface 1302, the processor 1303, the memory 1304, and the output interface 1305 are connected to each other via a bus 1310, and the input device 1301 and the output device 1306 are connected to the bus 1310 via the input interface 1302 and the output interface 1305, respectively, and further connected to other components of the computing device 1300.
Specifically, the input device 1301 receives input information from the outside and transmits the input information to the processor 1303 through the input interface 1302; the processor 1303 processes input information based on computer-executable instructions stored in the memory 1304 to generate output information, stores the output information in the memory 1304 temporarily or permanently, and then transmits the output information to the output device 1306 through the output interface 1305; output device 1306 outputs output information to the exterior of computing device 1300 for use by a user.
The processor 1303 may include: a Central Processing Unit (CPU), a Network Processor (NPU), a Tensor Processing Unit (TPU), a Field Programmable Gate Array (FpGA) chip, an Artificial Intelligence (AI) chip, and the like, and the drawings are merely exemplary and are not limited to the type of processors listed herein.
That is, the computing device shown in fig. 13 may also be implemented to include: a memory storing computer-executable instructions; and a processor which, when executing the computer-executable instructions, may implement any embodiment of any of the first to tenth aspects.
The embodiment of the invention also provides a computer storage medium, wherein the computer storage medium is stored with computer program instructions; the computer program instructions, when executed by a processor, implement the method for determining an utterance object provided by an embodiment of the present invention.
The functional blocks shown in the above structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
As will be apparent to those skilled in the art, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.

Claims (18)

1.一种发声对象确定方法,包括:1. A method for determining a sounding object, comprising: 获取第二发声对象发出的目标音频帧以及所述第二发声对象的目标位置信息;Obtain the target audio frame sent by the second sounding object and the target position information of the second sounding object; 确定所述目标位置信息与第一目标音频段对应的第一位置信息不匹配,则提取第二目标音频段的目标声纹特征,其中,所述第一目标音频段包括所述目标音频帧的前N个音频帧;所述第一目标音频段中的音频帧的发声对象和所述第二目标音频段中的音频帧的发声对象相同;所述第一目标音频段为所述第二目标音频段的至少一部分;N为大于或等于1的整数;It is determined that the target position information does not match the first position information corresponding to the first target audio segment, and then the target voiceprint feature of the second target audio segment is extracted, wherein the first target audio segment includes the target audio frame. The first N audio frames; the sounding object of the audio frame in the first target audio segment is the same as the sounding object of the audio frame in the second target audio segment; the first target audio segment is the second target At least a part of the audio segment; N is an integer greater than or equal to 1; 根据所述目标声纹特征确定所述第二目标音频段的目标发声对象。The target vocal object of the second target audio segment is determined according to the target voiceprint feature. 2.根据权利要求1所述的方法,其中,所述目标位置信息包括所述第二发声对象与预设音频采集器之间的相对位置信息。2 . The method of claim 1 , wherein the target position information includes relative position information between the second sound-emitting object and a preset audio collector. 3 . 3.根据权利要求1所述的方法,其中,所述确定所述目标位置信息与第一目标音频段对应的第一位置信息不匹配,则提取第二目标音频段的目标声纹特征之前,所述方法还包括:3. The method according to claim 1 , wherein the determining that the target position information does not match the first position information corresponding to the first target audio segment, before extracting the target voiceprint feature of the second target audio segment, The method also includes: 对所述第一目标音频段中的音频帧对应的发声对象的位置信息进行滤波,得到滤波后的位置信息;Filtering the position information of the sounding object corresponding to the audio frame in the first target audio segment to obtain the filtered position information; 基于所述滤波后的位置信息,确定所述第一位置信息。The first location information is determined based on the filtered location information. 4.根据权利要求1所述的方法,其中,所述根据所述目标声纹特征确定所述第二目标音频段的目标发声对象,包括:4. The method according to claim 1, wherein the determining the target vocal object of the second target audio segment according to the target voiceprint feature comprises: 在预设声音数据库中存在与所述目标声纹特征的匹配度满足第一预设匹配条件的第一声纹特征的情况下,将所述第一声纹特征对应的发声对象确定为所述第二目标音频段的目标发声对象;In the case that there is a first voiceprint feature whose matching degree with the target voiceprint feature satisfies a first preset matching condition in the preset voice database, the sounding object corresponding to the first voiceprint feature is determined to be the The target sounding object of the second target audio segment; 在所述第一声纹特征与所述目标声纹特征的匹配度满足第二预设匹配条件的情况下,利用所述目标声纹特征更新所述目标发声对象在所述预设声音数据库中对应的声纹特征;In the case that the matching degree between the first voiceprint feature and the target voiceprint feature satisfies a second preset matching condition, use the target voiceprint feature to update the target voice-emitting object in the preset voice database Corresponding voiceprint features; 其中,所述第二预设匹配条件对应需要满足的匹配度大于所述第一预设匹配条件对应需要满足的匹配度。Wherein, the matching degree that needs to be satisfied corresponding to the second preset matching condition is greater than the matching degree that needs to be satisfied corresponding to the first preset matching condition. 5.根据权利要求4所述的方法,其中,在所述目标位置信息与所述第一位置信息之间的匹配度小于第一预设位置匹配度阈值且大于第二预设位置匹配度阈值的情况下,所述第一预设匹配条件包括所述预设声音数据库中的声纹特征与所述目标声纹特征的匹配度大于第一预设声纹匹配度阈值;所述第二预设匹配条件包括所述预设声音数据库中的声纹特征与所述目标声纹特征的匹配度大于第二预设声纹匹配度阈值,其中,所述第二预设声纹匹配度阈值大于所述第一预设声纹匹配度阈值;5. The method according to claim 4, wherein the matching degree between the target position information and the first position information is less than a first preset position matching degree threshold and greater than a second preset position matching degree threshold In the case of , the first preset matching condition includes that the matching degree of the voiceprint feature in the preset voice database and the target voiceprint feature is greater than the first preset voiceprint matching degree threshold; the second preset voiceprint matching degree threshold The matching conditions include that the matching degree of the voiceprint feature in the preset voice database and the target voiceprint feature is greater than a second preset voiceprint matching degree threshold, wherein the second preset voiceprint matching degree threshold is greater than the first preset voiceprint matching degree threshold; 在所述目标位置信息与所述第一位置信息之间的匹配度小于所述第二预设位置匹配度阈值的情况下,所述第一预设匹配条件包括所述预设声音数据库中的声纹特征与所述目标声纹特征的匹配度大于第三预设声纹匹配度阈值;所述第二预设匹配条件包括所述预设声音数据库中的声纹特征与所述目标声纹特征的匹配度大于第四预设声纹匹配度阈值;In the case that the matching degree between the target position information and the first position information is less than the second preset position matching degree threshold, the first preset matching condition includes the data in the preset sound database. The matching degree of the voiceprint feature and the target voiceprint feature is greater than a third preset voiceprint matching degree threshold; the second preset matching condition includes the voiceprint feature in the preset voice database and the target voiceprint The matching degree of the feature is greater than the fourth preset voiceprint matching degree threshold; 其中,所述第四预设声纹匹配度阈值大于所述第三预设声纹匹配度阈值,所述第一预设声纹匹配度阈值小于所述第二预设声纹匹配度阈值,所述第二预设声纹匹配度阈值小于所述第四预设声纹匹配度阈值。Wherein, the fourth preset voiceprint matching degree threshold is greater than the third preset voiceprint matching degree threshold, and the first preset voiceprint matching degree threshold is smaller than the second preset voiceprint matching degree threshold, The second preset voiceprint matching degree threshold is smaller than the fourth preset voiceprint matching degree threshold. 6.根据权利要求4所述的方法,其中,所述方法还包括:6. The method of claim 4, wherein the method further comprises: 在所述预设声音数据库中的每个声纹特征与所述目标声纹特征的匹配度均满足第三预设匹配条件的情况下,则在所述预设声音数据库中存储所述目标声纹特征以及所述目标声纹特征对应的发声对象之间的对应关系,并将所述目标声纹特征对应的发声对象确定为所述第二目标音频段的目标发声对象;In the case where the matching degree of each voiceprint feature in the preset sound database and the target voiceprint feature satisfies the third preset matching condition, the target sound is stored in the preset sound database the corresponding relationship between the voiceprint feature and the sounding object corresponding to the target voiceprint feature, and determining the sounding object corresponding to the target voiceprint feature as the target sounding object of the second target audio segment; 其中,所述第三预设匹配条件用于表征所述预设声音数据库中的声纹特征与所述目标声纹特征不匹配。The third preset matching condition is used to indicate that the voiceprint feature in the preset sound database does not match the target voiceprint feature. 7.根据权利要求6所述的方法,其中,在所述目标位置信息与所述第一位置信息之间的匹配度小于第一预设位置匹配度阈值且大于第二预设位置匹配度阈值的情况下,所述第三预设匹配条件包括所述预设声音数据库中的声纹特征与所述目标声纹特征的匹配度小于第五预设声纹匹配度阈值;7. The method according to claim 6, wherein the matching degree between the target position information and the first position information is less than a first preset position matching degree threshold and greater than a second preset position matching degree threshold In the case of , the third preset matching condition includes that the matching degree between the voiceprint feature in the preset voice database and the target voiceprint feature is less than the fifth preset voiceprint matching degree threshold; 在所述目标位置信息与所述第一位置信息之间的匹配度小于所述第二预设位置匹配度阈值的情况下,所述第三预设匹配条件包括所述预设声音数据库中的声纹特征与所述目标声纹特征的匹配度小于第六预设声纹匹配度阈值;In the case that the matching degree between the target position information and the first position information is smaller than the second preset position matching degree threshold, the third preset matching condition includes the data in the preset sound database. The matching degree of the voiceprint feature and the target voiceprint feature is less than the sixth preset voiceprint matching degree threshold; 其中,所述第五预设声纹匹配度阈值小于所述第六预设声纹匹配度阈值。Wherein, the fifth preset voiceprint matching degree threshold is smaller than the sixth preset voiceprint matching degree threshold. 8.一种发声对象确定方法,包括:8. A method for determining a sounding object, comprising: 获取第二发声对象发出的目标音频帧以及所述第二发声对象的目标位置信息;Obtain the target audio frame sent by the second sounding object and the target position information of the second sounding object; 确定所述目标位置信息与第一目标音频段对应的第一位置信息不匹配,则提取所述第一目标音频段的目标声纹特征,其中,所述第一目标音频段包括第一发声对象发出的全部连续音频帧,所述第一发声对象为发出所述目标音频帧的前一音频帧的发声对象;所述第一目标音频段的终点为所述目标音频帧的前一音频帧;It is determined that the target position information does not match the first position information corresponding to the first target audio segment, and then the target voiceprint feature of the first target audio segment is extracted, wherein the first target audio segment includes a first sounding object All continuous audio frames sent out, the first sounding object is the sounding object that sends out the previous audio frame of the target audio frame; the end point of the first target audio segment is the previous audio frame of the target audio frame; 根据所述目标声纹特征确定所述第一目标音频段的目标发声对象。The target sounding object of the first target audio segment is determined according to the target voiceprint feature. 9.一种发声内容起点确定方法,包括:9. A method for determining a starting point of utterance content, comprising: 获取第二发声对象发出的目标音频帧以及所述第二发声对象的目标位置信息;Obtain the target audio frame sent by the second sounding object and the target position information of the second sounding object; 确定所述目标位置信息与第一目标音频段对应的第一位置信息不匹配,则将所述目标音频帧确定为所述第二发声对象发声内容的起点;determining that the target position information does not match the first position information corresponding to the first target audio segment, then determining the target audio frame as the starting point of the voice content of the second voice-emitting object; 其中,所述第一目标音频段包括第一发声对象发出的全部连续音频帧,所述第一发声对象为发出所述目标音频帧的前一音频帧的发声对象;所述第一目标音频段的终点为所述目标音频帧的前一音频帧。Wherein, the first target audio segment includes all continuous audio frames emitted by the first sounding object, and the first sounding object is the sounding object that emits the audio frame preceding the target audio frame; the first target audio segment The end point of is the previous audio frame of the target audio frame. 10.一种发声对象标识变更方法,包括:10. A method for changing the identification of a sounding object, comprising: 获取第二发声对象发出的目标音频帧以及所述第二发声对象的目标位置信息;Obtain the target audio frame sent by the second sounding object and the target position information of the second sounding object; 确定所述目标位置信息与第一目标音频段对应的第一位置信息不匹配,则提取第二目标音频段的目标声纹特征,其中,所述第一目标音频段包括所述目标音频帧的前N音频帧;所述第一目标音频段中的音频帧的发声对象和所述第二目标音频段中的音频帧的发声对象相同;所述第一目标音频段为所述第二目标音频段的至少一部分;It is determined that the target position information does not match the first position information corresponding to the first target audio segment, and then the target voiceprint feature of the second target audio segment is extracted, wherein the first target audio segment includes the target audio frame. The first N audio frames; the sounding object of the audio frame in the first target audio segment is the same as the sounding object of the audio frame in the second target audio segment; the first target audio segment is the second target audio at least part of a segment; 根据所述目标声纹特征确定所述第二目标音频段的目标发声对象;Determine the target vocalization object of the second target audio segment according to the target voiceprint feature; 变更所述第二发声对象的标识和所述目标发声对象的标识,所述标识用于表征发声对象的发声状态。The identification of the second sounding object and the identification of the target sounding object are changed, and the identification is used to represent the sounding state of the sounding object. 11.一种会话记录生成方法,包括:11. A method for generating a session record, comprising: 获取音频会话数据中第二发声对象发出的目标音频帧以及所述第二发声对象的目标位置信息;Obtain the target audio frame sent by the second sounding object in the audio session data and the target position information of the second sounding object; 确定所述目标位置信息与所述音频会话数据中第一目标音频段对应的第一位置信息不匹配,则提取所述音频会话数据中第二目标音频段的目标声纹特征,其中,所述第一目标音频段包括所述目标音频帧的前N音频帧;所述第一目标音频段中的音频帧的发声对象和所述第二目标音频段中的音频帧的发声对象相同;所述第一目标音频段为所述第二目标音频段的至少一部分;Determining that the target location information does not match the first location information corresponding to the first target audio segment in the audio session data, extracting the target voiceprint feature of the second target audio segment in the audio session data, wherein the The first target audio segment includes the first N audio frames of the target audio frame; the sounding object of the audio frame in the first target audio segment is the same as the sounding object of the audio frame in the second target audio segment; the The first target audio segment is at least a part of the second target audio segment; 根据所述目标声纹特征确定所述第二目标音频段的目标发声对象;Determine the target vocalization object of the second target audio segment according to the target voiceprint feature; 将所述目标发声对象与所述第二目标音频段对应的文字内容进行关联,得到所述目标发声对象的会话记录。Associating the target sounding object with the text content corresponding to the second target audio segment to obtain a conversation record of the target sounding object. 12.一种发声对象确定装置,其中,所述装置包括:12. An apparatus for determining a sounding object, wherein the apparatus comprises: 获取模块,用于获取第二发声对象发出的目标音频帧以及所述第二发声对象的目标位置信息;an acquisition module for acquiring the target audio frame sent by the second sounding object and the target position information of the second sounding object; 提取模块,用于确定所述目标位置信息与第一目标音频段对应的第一位置信息不匹配,则提取第二目标音频段的目标声纹特征,其中,所述第一目标音频段包括所述目标音频帧的前N个音频帧;所述第一目标音频段中的音频帧的发声对象和所述第二目标音频段中的音频帧的发声对象相同;所述第一目标音频段为所述第二目标音频段的至少一部分;N为大于或等于1的整数;The extraction module is used to determine that the target position information does not match the first position information corresponding to the first target audio segment, then extract the target voiceprint feature of the second target audio segment, wherein the first target audio segment includes the The first N audio frames of the target audio frame; the sounding object of the audio frame in the first target audio segment is the same as the sounding object of the audio frame in the second target audio segment; the first target audio segment is at least a part of the second target audio segment; N is an integer greater than or equal to 1; 第一确定模块,用于根据所述目标声纹特征确定所述第二目标音频段的目标发声对象。The first determining module is configured to determine the target uttering object of the second target audio segment according to the target voiceprint feature. 13.一种发声对象确定装置,包括:13. An apparatus for determining a sounding object, comprising: 获取模块,用于获取第二发声对象发出的目标音频帧以及所述第二发声对象的目标位置信息;an acquisition module for acquiring the target audio frame sent by the second sounding object and the target position information of the second sounding object; 提取模块,用于确定所述目标位置信息与第一目标音频段对应的第一位置信息不匹配,则提取所述第一目标音频段的目标声纹特征,其中,所述第一目标音频段包括第一发声对象发出的全部连续音频帧,所述第一发声对象为发出所述目标音频帧的前一音频帧的发声对象;所述第一目标音频段的终点为所述目标音频帧的前一音频帧;An extraction module, configured to determine that the target position information does not match the first position information corresponding to the first target audio segment, then extract the target voiceprint feature of the first target audio segment, wherein the first target audio segment Including all continuous audio frames issued by the first sounding object, the first sounding object is the sounding object that issued the previous audio frame of the target audio frame; the end point of the first target audio segment is the end of the target audio frame. the previous audio frame; 第一确定模块,用于根据所述目标声纹特征确定所述第一目标音频段的目标发声对象。The first determining module is configured to determine the target sounding object of the first target audio segment according to the target voiceprint feature. 14.一种发声内容起点确定装置,包括:14. A device for determining the origin of utterance content, comprising: 获取模块,用于获取第二发声对象发出的目标音频帧以及所述第二发声对象的目标位置信息;an acquisition module for acquiring the target audio frame sent by the second sounding object and the target position information of the second sounding object; 第一确定模块,用于确定所述目标位置信息与第一目标音频段对应的第一位置信息不匹配,则将所述目标音频帧确定为所述第二发声对象发声内容的起点;a first determining module, configured to determine that the target position information does not match the first position information corresponding to the first target audio segment, and then determine the target audio frame as the starting point of the voice content of the second voice-emitting object; 其中,所述第一目标音频段包括第一发声对象发出的全部连续音频帧,所述第一发声对象为发出所述目标音频帧的前一音频帧的发声对象;所述第一目标音频段的终点为所述目标音频帧的前一音频帧。Wherein, the first target audio segment includes all continuous audio frames emitted by the first sounding object, and the first sounding object is the sounding object that emits the audio frame preceding the target audio frame; the first target audio segment The end point of is the previous audio frame of the target audio frame. 15.一种发声对象标识变更装置,包括:15. A device for changing the identification of a sounding object, comprising: 获取模块,用于获取第二发声对象发出的目标音频帧以及所述第二发声对象的目标位置信息;an acquisition module for acquiring the target audio frame sent by the second sounding object and the target position information of the second sounding object; 提取模块,用于确定所述目标位置信息与第一目标音频段对应的第一位置信息不匹配,则提取第二目标音频段的目标声纹特征,其中,所述第一目标音频段包括所述目标音频帧的前N音频帧;所述第一目标音频段中的音频帧的发声对象和所述第二目标音频段中的音频帧的发声对象相同;所述第一目标音频段为所述第二目标音频段的至少一部分;The extraction module is used to determine that the target position information does not match the first position information corresponding to the first target audio segment, then extract the target voiceprint feature of the second target audio segment, wherein the first target audio segment includes the The first N audio frames of the target audio frame; the sounding object of the audio frame in the first target audio segment is the same as the sounding object of the audio frame in the second target audio segment; the first target audio segment is the at least a part of the second target audio segment; 第一确定模块,用于根据所述目标声纹特征确定所述第二目标音频段的目标发声对象;a first determining module, configured to determine the target vocalization object of the second target audio segment according to the target voiceprint feature; 变更模块,用于变更所述第二发声对象的标识和所述目标发声对象的标识,所述标识用于表征发声对象的发声状态。A changing module, configured to change the identification of the second sounding object and the identification of the target sounding object, where the identification is used to represent the sounding state of the sounding object. 16.一种会话记录生成装置,包括:16. An apparatus for generating a session record, comprising: 获取模块,用于获取音频会话数据中第二发声对象发出的目标音频帧以及所述第二发声对象的目标位置信息;an acquisition module for acquiring the target audio frame sent by the second vocal object in the audio session data and the target position information of the second vocal object; 提取模块,用于确定所述目标位置信息与所述音频会话数据中第一目标音频段对应的第一位置信息不匹配,则提取所述音频会话数据中第二目标音频段的目标声纹特征,其中,所述第一目标音频段包括所述目标音频帧的前N音频帧;所述第一目标音频段中的音频帧的发声对象和所述第二目标音频段中的音频帧的发声对象相同;所述第一目标音频段为所述第二目标音频段的至少一部分;An extraction module, configured to determine that the target position information does not match the first position information corresponding to the first target audio segment in the audio session data, then extract the target voiceprint feature of the second target audio segment in the audio session data , wherein the first target audio segment includes the first N audio frames of the target audio frame; the sounding object of the audio frame in the first target audio segment and the sounding object of the audio frame in the second target audio segment The objects are the same; the first target audio segment is at least a part of the second target audio segment; 第一确定模块,用于根据所述目标声纹特征确定所述第二目标音频段的目标发声对象;a first determining module, configured to determine the target vocalization object of the second target audio segment according to the target voiceprint feature; 关联模块,用于将所述目标发声对象与所述第二目标音频段对应的文字内容进行关联,得到所述目标发声对象的会话记录。The association module is used for associating the target sounding object with the text content corresponding to the second target audio segment to obtain a conversation record of the target sounding object. 17.一种计算设备,其中,所述计算设备包括:处理器以及存储有计算机程序指令的存储器;17. A computing device, wherein the computing device comprises: a processor and a memory storing computer program instructions; 所述处理器执行所述计算机程序指令时实现如权利要求1-11任意一项所述的方法。The processor implements the method of any one of claims 1-11 when executing the computer program instructions. 18.一种计算机存储介质,其中,所述计算机存储介质上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现如权利要求1-11任意一项所述的方法。18. A computer storage medium, wherein computer program instructions are stored on the computer storage medium, and when the computer program instructions are executed by a processor, implement the method according to any one of claims 1-11.
CN202010736267.2A 2020-07-28 2020-07-28 Sound production object determination method and device, computing equipment and medium Pending CN114005451A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010736267.2A CN114005451A (en) 2020-07-28 2020-07-28 Sound production object determination method and device, computing equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010736267.2A CN114005451A (en) 2020-07-28 2020-07-28 Sound production object determination method and device, computing equipment and medium

Publications (1)

Publication Number Publication Date
CN114005451A true CN114005451A (en) 2022-02-01

Family

ID=79920283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010736267.2A Pending CN114005451A (en) 2020-07-28 2020-07-28 Sound production object determination method and device, computing equipment and medium

Country Status (1)

Country Link
CN (1) CN114005451A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968991A (en) * 2012-11-29 2013-03-13 华为技术有限公司 Method, device and system for sorting voice conference minutes
US20140348308A1 (en) * 2013-05-22 2014-11-27 Nuance Communications, Inc. Method And System For Speaker Verification
CN108924343A (en) * 2018-06-19 2018-11-30 Oppo广东移动通信有限公司 Electronic device control method, electronic device control device, storage medium and electronic device
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Method, device, computer equipment and storage medium for distinguishing conference content
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110875053A (en) * 2018-08-29 2020-03-10 阿里巴巴集团控股有限公司 Method, apparatus, system, device and medium for speech processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968991A (en) * 2012-11-29 2013-03-13 华为技术有限公司 Method, device and system for sorting voice conference minutes
US20140348308A1 (en) * 2013-05-22 2014-11-27 Nuance Communications, Inc. Method And System For Speaker Verification
CN108924343A (en) * 2018-06-19 2018-11-30 Oppo广东移动通信有限公司 Electronic device control method, electronic device control device, storage medium and electronic device
CN110875053A (en) * 2018-08-29 2020-03-10 阿里巴巴集团控股有限公司 Method, apparatus, system, device and medium for speech processing
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Method, device, computer equipment and storage medium for distinguishing conference content
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method

Similar Documents

Publication Publication Date Title
US12165653B2 (en) Matching speakers to meeting audio
TWI643184B (en) Method and apparatus for speaker diarization
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
CN108922538B (en) Conference information recording method, conference information recording device, computer equipment and storage medium
CN105161093B (en) A kind of method and system judging speaker's number
KR100879410B1 (en) Distributed Speech Recognition System Using Acoustic Feature Vector Correction
JP2017207770A (en) System and method for fingerprinting datasets
CN113744742B (en) Role identification method, device and system under dialogue scene
CN112053691B (en) Conference assisting method and device, electronic equipment and storage medium
WO2012175094A1 (en) Identification of a local speaker
JP5385876B2 (en) Speech segment detection method, speech recognition method, speech segment detection device, speech recognition device, program thereof, and recording medium
JP2016180839A (en) Noise suppression speech recognition apparatus and program thereof
CN112908336A (en) Role separation method for voice processing device and voice processing device thereof
WO2022126040A1 (en) User speech profile management
JP4973352B2 (en) Voice processing apparatus and program
JP5704071B2 (en) Audio data analysis apparatus, audio data analysis method, and audio data analysis program
CN115050372B (en) Audio fragment clustering method and device, electronic equipment and medium
CN113921026B (en) Voice enhancement method and device
KR20210150372A (en) Signal processing device, signal processing method and program
JP2017191531A (en) Communication system, server, and communication method
CN114005451A (en) Sound production object determination method and device, computing equipment and medium
CN114038487B (en) Audio extraction method, device, equipment and readable storage medium
US6934364B1 (en) Handset identifier using support vector machines
CN114005436A (en) Method, device and storage medium for determining voice endpoint
JP2013235050A (en) Information processing apparatus and method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination