[go: up one dir, main page]

CN119943091A - Audio stream identification method, device, electronic device and computer storage medium - Google Patents

Audio stream identification method, device, electronic device and computer storage medium Download PDF

Info

Publication number
CN119943091A
CN119943091A CN202311452191.0A CN202311452191A CN119943091A CN 119943091 A CN119943091 A CN 119943091A CN 202311452191 A CN202311452191 A CN 202311452191A CN 119943091 A CN119943091 A CN 119943091A
Authority
CN
China
Prior art keywords
audio stream
holographic
recognition model
audio
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311452191.0A
Other languages
Chinese (zh)
Inventor
张智欣
姚津
白金
林松
严锋贵
王东
姚永明
李鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN202311452191.0A priority Critical patent/CN119943091A/en
Publication of CN119943091A publication Critical patent/CN119943091A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种音频流的识别方法,该方法应用于电子设备中,包括:获取电子设备的音频流,对音频流进行特征提取,得到音频流的特征,将音频流的特征输入至训练后的场景识别模型中进行全息场景识别,得到音频流的全息场景类型。本申请实施例还同时提供了一种音频流的识别装置、电子设备及计算机存储介质。

The embodiment of the present application discloses a method for identifying an audio stream, which is applied to an electronic device, including: obtaining an audio stream of the electronic device, extracting features from the audio stream to obtain features of the audio stream, inputting the features of the audio stream into a trained scene recognition model to perform holographic scene recognition, and obtaining the holographic scene type of the audio stream. The embodiment of the present application also provides an audio stream recognition device, an electronic device, and a computer storage medium.

Description

Audio stream identification method and device, electronic equipment and computer storage medium
Technical Field
The present application relates to a playing technology of spatial audio in holographic audio, and in particular, to a method and apparatus for identifying an audio stream, an electronic device, and a computer storage medium.
Background
Currently, a mobile terminal sets a holographic scene type of an Application program through a software package name (PACKAGE NAME) of an Application where an audio Stream type (Stream) is located, for example, when an Application program (Application, APP) of a music type plays sound, the holographic scene type is determined according to the software package name of the APP, and the spatial audio of the APP is played based on the determined holographic scene type.
However, the method is adopted to determine the holographic scene type, and the determination of the holographic scene type according to the software package name of the application program leads to poor accuracy of the determined result because of more types of audio streams playable by the application program, so that the technical problem of low accuracy exists in the conventional method for determining the holographic scene type under holographic audio can be seen.
Disclosure of Invention
The embodiment of the application provides an audio stream identification method, an audio stream identification device, electronic equipment and a computer storage medium, which can improve the accuracy of determining the type of a holographic scene under holographic audio.
The technical scheme of the application is realized as follows:
The embodiment of the application provides an audio stream identification method, which is applied to electronic equipment and comprises the following steps:
acquiring an audio stream of the electronic equipment;
extracting the characteristics of the audio stream to obtain the characteristics of the audio stream;
And inputting the characteristics of the audio stream into the trained scene recognition model to perform holographic scene recognition, so as to obtain the holographic scene type of the audio stream.
The embodiment of the application provides an audio stream identification method, which is applied to a cloud server and comprises the following steps:
Acquiring a sample training set, wherein the sample training set comprises an acquired audio stream and a holographic scene type of the acquired audio stream;
training the scene recognition model by using the sample training set to obtain a trained scene recognition model;
and sending the trained scene recognition model to the electronic equipment.
An embodiment of the present application provides an apparatus for identifying an audio stream, where the apparatus is disposed in an electronic device, and includes:
the first acquisition module is used for acquiring the audio stream of the electronic equipment;
The extraction module is used for extracting the characteristics of the audio stream to obtain the characteristics of the audio stream;
and the identification module is used for inputting the characteristics of the audio stream into the trained scene identification model to carry out holographic scene identification, so as to obtain the holographic scene type of the audio stream.
The embodiment of the application provides an audio stream identification device, which is arranged in a cloud server and comprises:
The system comprises a first acquisition module, a second acquisition module and a control module, wherein the first acquisition module is used for acquiring a sample training set, and the sample training set comprises an acquired audio stream and a holographic scene type of the acquired audio stream;
The training module is used for training the scene recognition model by using the sample training set to obtain a trained scene recognition model;
And the sending module is used for sending the trained scene recognition model to the electronic equipment.
An embodiment of the present application provides an electronic device, including:
A processor and a storage medium storing instructions executable by the processor, the storage medium performing operations in dependence on the processor through a communication bus, when the instructions are executed by the processor, performing the method of identifying an audio stream as described in one or more embodiments above.
The embodiment of the application provides a cloud server, which comprises:
A processor and a storage medium storing instructions executable by the processor, the storage medium performing operations in dependence on the processor through a communication bus, when the instructions are executed by the processor, performing the method of identifying an audio stream as described in one or more embodiments above.
Embodiments of the present application provide a computer storage medium storing executable instructions that, when executed by one or more processors, perform a method of identifying an audio stream according to one or more embodiments.
The embodiment of the application provides an audio stream identification method, an apparatus, an electronic device and a computer storage medium, wherein the method is applied to the electronic device and comprises the steps of obtaining the audio stream of the electronic device, extracting characteristics of the audio stream to obtain characteristics of the audio stream, inputting the characteristics of the audio stream into a trained scene identification model to carry out holographic scene identification to obtain holographic scene types of the audio stream, namely, in the embodiment of the application, the holographic scene types of the audio stream can be obtained by inputting the audio stream of the electronic device into the trained scene identification model to carry out holographic scene identification, so that the holographic scene types of the audio stream are identified by utilizing the trained scene identification model, and the determined holographic scene types are more suitable for holographic audio, so that the accuracy of determining the holographic scene types under the holographic audio is improved, and the sound effect of the holographic audio is improved.
Drawings
Fig. 1 is a schematic flow chart of an alternative audio stream identification method according to an embodiment of the present application;
Fig. 2 is a second flowchart of an alternative audio stream identification method according to an embodiment of the present application;
fig. 3 is a flowchart illustrating an example one of an alternative audio stream identification method according to an embodiment of the present application;
Fig. 4 is a flowchart illustrating an example two of an alternative audio stream identification method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an alternative audio stream recognition device according to an embodiment of the present application;
Fig. 6 is a schematic structural diagram II of an alternative audio stream recognition device according to an embodiment of the present application;
Fig. 7 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application;
Fig. 8 is a schematic structural diagram of an alternative cloud server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.
In the related art, in determining the holographic scene type of an audio stream, for example, a common music APP has live broadcast, video and music, and for example, a common communication APP may watch video on a social platform, watch video on a communication interface with friends, listen to voice messages, talk in voice, and use a navigation function, and it is often inappropriate to divide the holographic scene type according to the type of an application.
Secondly, it is often not appropriate to determine the type of holographic scene based on the type of audio stream being played by the application, because the type of audio stream is set by the application when it is played, and the accuracy is poor.
Finally, if the holographic scene type is identified by the software package name of the application program, the identification of the holographic scene type can only be applied to the known software package, and the limited list does not contain the unknown application program, so that the accuracy is poor.
In addition, holographic audio (also called holographic sound effect) is a technology simulating auditory perception mode, namely 3D sound effect technology, which stereoscopically displays sound from a single plane through complex algorithm and processing mode, so that a listener can clearly perceive the position and distance of the sound in space, and then, the control of an application program under the holographic audio is conducted.
In order to solve the technical problem of low accuracy in a manner of determining a holographic scene type under holographic audio in the related art, an embodiment of the present application provides an audio stream identification method, where the method is applied to an electronic device, fig. 1 is a schematic flow diagram of an alternative audio stream identification method provided by the embodiment of the present application, and as shown in fig. 1, the audio stream identification method may include:
s101, acquiring an audio stream of electronic equipment;
In S101, the electronic device first obtains the audio streams of the electronic device, where the number of the audio streams may be one, or may be two or more, and here, the embodiment of the present application is not limited in particular.
In addition, the audio stream of the electronic device may be one audio stream of one application, two or more audio streams of one application, and two or more audio streams of two applications, where the embodiment of the application is not limited in particular.
Specifically, the application program writes the audio stream to the memory of the electronic device through the AudioTrack interface, so that the electronic device obtains the audio stream of the electronic device.
S102, extracting characteristics of an audio stream to obtain the characteristics of the audio stream;
After S101, the electronic device performs feature extraction on the audio stream, where before the feature extraction, processing such as data cleaning, file format alignment, data enhancement, and spectrogram conversion may be sequentially performed on the audio stream to obtain a processed audio stream, and it should be noted that the above processing is not limited thereto.
After the processing is obtained, the processed audio stream is subjected to feature extraction by a front-end processing algorithm (FiliterBank, fbank) or an algorithm such as Mel frequency cepstrum coefficient (Mel-frequency cepstral coefficients, MFCC) and the like, so that the features of the audio stream can be obtained.
The above feature extraction is performed on each audio stream, for example, when the audio stream is one audio stream, the feature of the audio stream is obtained by performing feature extraction on the audio stream, and when the audio stream is two or more audio streams, the feature of each audio stream is obtained by performing feature extraction on each audio stream.
In this way, the characteristics of the audio stream can be derived for determining the holographic scene type of the audio stream.
S103, inputting the characteristics of the audio stream into the trained scene recognition model to perform holographic scene recognition, and obtaining the holographic scene type of the audio stream.
After determining the features of the audio stream through S102, since the trained scene recognition model is stored in the electronic device, the electronic device may input the features of the audio stream into the trained scene recognition model to perform holographic scene recognition, so as to obtain the holographic scene type of the audio stream.
The trained scene recognition model may be obtained by training the electronic device based on a sample data set of the model, or may be obtained by training the cloud server based on a sample data set of the model, and after the trained scene recognition model is obtained, the trained scene recognition model is sent to the electronic device, so that the trained scene recognition model is stored in the electronic device, where the embodiment of the application is not limited in detail.
When the number of the audio streams is one, the characteristics of the audio stream are input into the trained scene recognition model to obtain the holographic scene type of the audio stream, and when the number of the audio streams is two or more, the characteristics of each audio stream are input into the trained scene recognition model to obtain the holographic scene type of each audio stream.
In addition, the above holographic scene types may include game background sound, music, voice, one type of video, two types of video, system call, voice over IP (Voice over Internet Protocol, voIP) type of voice, bell, alarm clock, notification, navigation, game voice, etc., which the embodiments of the present application are not limited in detail herein.
Therefore, the holographic scene type of the audio stream can be obtained, and the trained scene recognition model is adopted to ensure that the determination of the holographic scene type is more accurate, and the audio effect under holographic audio is better.
In general, in a scene of holographic audio, there are multiple audio streams to be played, and in order to implement identification of holographic scene types of the multiple audio streams, in an alternative embodiment, when the number of audio streams is at least two, S102 may include:
Extracting the characteristics of each audio stream in the audio streams to obtain the characteristics of each audio stream;
Accordingly, S103 may include:
and inputting the characteristics of each audio stream into the trained scene recognition model to respectively carry out holographic scene recognition to obtain the holographic scene type of each audio stream.
It can be understood that when the number of the audio streams is at least two, the electronic device extracts the characteristics of each audio stream when the electronic device acquires at least two audio streams, so that the characteristics of each audio stream can be obtained, and then the characteristics of each audio stream are input into the trained scene recognition model to respectively perform holographic scene recognition, so that the holographic scene type of each audio stream can be obtained.
That is, for two or more audio streams, feature extraction and scene recognition are performed respectively, so that the holographic scene type of each audio stream is determined respectively, which is helpful to improve the accuracy of determining the holographic scene type, so that the electronic device can know the holographic scene type of each audio stream, and play the spatial audio corresponding to the audio stream by utilizing the holographic audio better.
In order to enhance the sound effect of the audio stream under holographic audio, in an alternative embodiment, the method may further comprise:
Based on the holographic scene type of the audio stream, playing the spatial audio corresponding to the audio stream.
It will be appreciated that when the holographic scene type of the audio stream is obtained, the spatial audio corresponding to the audio stream may be played based on the determined holographic scene type, wherein mainly the holographic scene type of the audio stream, the spatial position of the audio stream is determined, and then the electronic device plays the spatial audio based on the determined spatial position of the audio stream.
Therefore, the spatial audio corresponding to the audio stream is played based on the holographic scene type of the audio stream, and the spatial sound effect of the audio stream in the holographic scene can be improved.
In order to improve the spatial sound effect of the audio stream under the holographic audio when the number of the audio streams is one, in an alternative embodiment, playing the spatial audio corresponding to the audio stream based on the holographic scene type of the audio stream may include:
Determining a spatial position corresponding to the holographic scene type based on the holographic scene type of the audio stream;
And playing the spatial audio of the audio stream based on the spatial position corresponding to the holographic scene type.
It can be understood that, the electronic device determines the spatial position corresponding to the holographic scene type based on the holographic scene type of the audio stream, where the electronic device may be preset with a correspondence between the scene type and the spatial position, then determine the spatial position corresponding to the holographic scene type based on the correspondence, and play the spatial audio of the audio stream at the determined spatial position corresponding to the holographic scene type.
The central position in the space where the holographic audio is located may be directly determined as the spatial position corresponding to the holographic scene type, and the spatial audio of the audio stream may be played at the central position, which is not particularly limited herein.
In this way, the spatial audio of the audio stream can be played based on the holographic scene type of the audio stream, thereby improving the spatial sound effect under the holographic audio.
In addition, for improving spatial sound effects in the holographic scene when the number of audio streams is at least two, in an alternative embodiment, playing spatial audio corresponding to the audio streams based on the holographic scene type of the audio streams when the number of audio streams is at least two may include:
Determining a spatial position corresponding to each audio stream based on the priority of the holographic scene type of each audio stream in the audio streams;
and playing the spatial audio of each audio stream based on the corresponding spatial position of each audio stream.
It will be appreciated that the electronic device determines the priority of the holographic scene type of each audio stream, where it is to be noted that each holographic scene type corresponds to a priority, e.g. in order from high to low, game background sound, music, voice, one type of video, two types of video, system call, voIP type of voice, ringing, alarm clock, notification, navigation, game voice.
After determining the priority of the holographic scene type of each audio stream, the electronic device may store in advance a correspondence between the priority and the spatial position, and may determine the spatial position of each audio stream based on the correspondence, or may determine the spatial position from high to low according to the priority, where embodiments of the present application are not limited in this particular manner.
After determining the spatial position corresponding to each audio stream, the electronic device may play the spatial audio of each audio stream at the spatial position corresponding to each audio stream.
In this way, the electronic device determines the spatial position corresponding to each audio stream according to the priority of the holographic scene type of each audio, and plays the spatial audio of each audio stream at the spatial position corresponding to each audio stream, which is beneficial to improving the spatial sound effect under the holographic audio.
In order to improve accuracy of holographic scene type identification, in an alternative embodiment, the method may further include:
when the trained scene recognition model continuously outputs the holographic scene types of the audio stream, and the continuous output times reach the preset times, the characteristics of the audio stream and the holographic scene types of the audio stream are sent to the cloud server.
It can be understood that, since the electronic device acquires the audio stream at intervals of a preset time period after acquiring the audio stream, so as to identify the holographic scene type of the audio stream in the above manner, in the model identification, when the trained scene identification model continuously outputs the holographic scene type of the audio stream, and the continuous output times reach the preset times, it is indicated that the identification result of continuously identifying the preset times is the same for the audio stream, and the accuracy of the identification result can be considered to be higher, and the method can be used for retraining the model, so that the characteristics of the audio stream and the holographic scene type of the audio stream are sent to the cloud server.
The method comprises the steps that characteristics of an audio stream and holographic scene types of the audio stream are used for a cloud server to train a locally trained scene recognition model to obtain new parameters of the trained scene recognition model so as to update the locally trained scene recognition model, namely, the cloud server can send the locally trained scene recognition model to the cloud server by utilizing the characteristics of the audio stream and the holographic scene types of the audio stream so as to continuously train the locally trained scene recognition model, so as to obtain the new parameters of the trained scene recognition model, further update the locally trained scene recognition model, and in addition, the new parameters can be used for an electronic device to update the locally trained scene recognition model.
Therefore, the parameters of the model are updated on the cloud server side and the electronic equipment side, and the accuracy of the type of the holographic scene identified by the model is improved.
In order to obtain a trained scene recognition model, in an alternative embodiment, the method may further include:
and acquiring the trained scene recognition model from the cloud server.
It can be understood that, besides the electronic device can train to obtain a trained scene recognition model, the electronic device can train in cloud service to obtain the trained scene recognition model, so that the electronic device can obtain the trained scene recognition model from the cloud server, and after the electronic device obtains the audio stream and extracts the characteristics of the audio stream, the electronic device carries out holographic scene recognition on the audio stream, so as to obtain the holographic scene type of the audio stream.
Therefore, the trained scene recognition model is trained through the cloud server, so that the recognition accuracy of the holographic scene type can be improved while the energy consumption of the electronic equipment is reduced.
In addition, in order to update parameters corresponding to the trained scene recognition model to improve accuracy of holographic scene recognition, in an alternative embodiment, the method may further include:
Acquiring new parameters of the trained scene recognition model from a cloud server;
and updating the parameters of the trained scene recognition model into new parameters so as to retrieve the trained scene recognition model.
It can be understood that after the electronic device obtains the trained scene recognition model, new parameters of the local trained scene recognition model can be obtained from the cloud server, and the parameters of the trained scene recognition model are updated to the new parameters, so that the trained scene recognition model can be obtained again, and the updating of the trained scene recognition model in the electronic device is completed.
Therefore, the method is beneficial to improving the accuracy of identifying the holographic scene type and improving the space sound effect under holographic audio.
To extract the features of the audio stream, in an alternative embodiment, S102 may include:
and extracting the mel-frequency cepstrum coefficient characteristics of the audio stream to obtain the characteristics of the audio stream.
It will be appreciated that, the processing of data cleaning, file format alignment, data enhancement, spectrogram conversion, etc. is mainly used to obtain the processed audio stream before the feature extraction, and it should be noted that the processing is not limited thereto, and the feature extraction is performed on the processed audio stream by using the algorithm such as Fbank or MFCC after the processing is obtained, so that the feature of the audio stream may be obtained.
Thus, after the characteristics of the audio stream are obtained, the more accurate holographic scene type of the audio stream can be obtained, and the accuracy of holographic scene identification is improved.
The embodiment of the application also provides an audio stream identification method, which is applied to a cloud server, and fig. 2 is a schematic flow diagram II of an alternative audio stream identification method provided by the embodiment of the application, as shown in fig. 2, the audio stream identification method may include:
s201, acquiring a sample data set;
s202, training a scene recognition model by using a sample data set to obtain a trained scene recognition model;
And S203, transmitting the trained scene recognition model to the electronic equipment.
In order to enable the electronic device to acquire the trained scene recognition model, in S201, the cloud server acquires a sample data set first, where the sample data set may include an acquired audio stream and a holographic scene type of the acquired audio stream, where the acquired audio stream may be an audio stream of one application program or may be an audio stream of two or more application programs, and embodiments of the present application are not limited in this way specifically.
After the cloud server acquires the sample data set, in S202, the cloud server trains the pre-stored scene recognition model by using the sample data set, so that a trained scene recognition model can be obtained, and in order to enable recognition of the holographic scene type in the electronic device, in S203, the cloud server sends the trained scene recognition model to the electronic device, so that the electronic device recognizes the holographic scene type of the audio stream.
Therefore, the electronic equipment can reduce the energy consumption of the electronic equipment and improve the accuracy of the identification of the holographic scene type.
In order for the cloud server to implement optimization of the trained scene recognition model in the electronic device, in an alternative embodiment, the method may further include:
Acquiring characteristics of an audio stream and a holographic scene type of the audio stream from electronic equipment;
Training the locally trained scene recognition model by utilizing the characteristics of the acquired audio stream and the acquired holographic scene type of the audio stream to obtain new parameters of the locally trained scene recognition model;
the new parameters are sent to the electronic device.
It can be understood that the cloud server can acquire the characteristics of the audio stream and the holographic scene type of the identified audio stream from the electronic device, wherein the holographic scene type of the audio stream is the type that the continuous output from the trained scene recognition model is the same, and the continuous output times reach the preset times, that is, the cloud server acquires accurate data from the electronic device, and can be used for retraining the locally trained scene recognition model to obtain new parameters of the trained scene recognition model.
The cloud server sends the new parameters to the electronic equipment, so that the electronic equipment can update the parameters of the scene recognition model after the local training to the new parameters, and the electronic equipment can optimize the scene recognition model after the local training.
Therefore, the accuracy of holographic scene recognition can be further improved by continuously optimizing the trained scene recognition model in the electronic equipment, so that the spatial sound effect under holographic audio is improved.
In addition, in order to improve accuracy of holographic scene recognition without affecting holographic scene type recognition, in an alternative embodiment, training a locally trained scene recognition model by using the features of the acquired audio stream and the acquired holographic scene type of the audio stream to obtain new parameters of the locally trained scene recognition model may include:
When the current moment reaches a preset time range, training the locally trained scene recognition model by utilizing the characteristics of the acquired audio stream and the acquired holographic scene type of the audio stream to obtain new parameters of the locally trained scene recognition model.
It can be understood that only when the current time reaches the preset time range, the cloud server trains the locally trained scene recognition model by utilizing the characteristics of the acquired audio stream and the acquired holographic scene type of the audio stream to obtain new parameters of the locally trained scene recognition model.
That is, the cloud server trains the locally trained scene recognition model only for a certain specific period of time to obtain new parameters, which are then used to optimize the trained scene recognition model in the electronic device.
The preset time range may be any time period in one day, and usually, a time period at night may be selected to perform retraining of the scene recognition model after local training, so as to prevent the cloud server from being affected by too many resources occupied during the day.
The following describes, by way of example, a method of identifying audio streams in one or more of the embodiments described above.
Fig. 3 is a flowchart of an example one of an alternative audio stream identification method provided in an embodiment of the present application, where, as shown in fig. 3, the audio stream identification method may include:
S301, the mobile terminal acquires an audio stream of an application program;
S302, the mobile terminal performs feature extraction on the audio stream to obtain features of the audio stream;
s303, the mobile terminal inputs the characteristics of the audio stream into a trained artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) model to obtain a holographic scene tag of the audio stream;
s304, the mobile terminal outputs the holographic scene tag.
Specifically, an application program in the mobile terminal writes audio stream data to the mobile terminal through an interface AudioTrack, the mobile terminal performs feature extraction on the part of data, firstly performs data cleaning, such as file format alignment, data enhancement and spectrogram conversion, then extracts features through algorithms such as Fbank or MFCC, and the like, extracts the features, calculates the trained AI model (equivalent to the trained scene recognition model), and then obtains the output holographic scene tag.
Fig. 4 is a flowchart of an example two of an alternative audio stream identification method according to an embodiment of the present application, where as shown in fig. 4, the audio stream identification method may include:
S401, the mobile terminal outputs a holographic scene label by using the trained AI model;
s402, the mobile terminal judges whether the holographic scene tag changes or not, and if yes, S403 is executed;
S403, storing the characteristics and the holographic scene tag by the mobile terminal;
s404, the mobile terminal waits for uploading recent data to a cloud server at night;
s405, the cloud server receives data, retrains the trained AI model by using the data, and obtains new parameters of the trained AI model;
and S406, the cloud server periodically issues new parameters to the mobile terminal.
The learning process of the AI model is divided into two parts, model training and night self-learning, the model training is not carried out on the mobile terminal, but carried out on a cloud server, the cloud server trains the AI model through the acquired sample data set, and then the trained AI model is preloaded on the mobile terminal.
For night self-learning, the mobile terminal outputs a holographic scene tag by using the trained AI model, then judges whether the number of times that the audio stream identified by the mobile terminal continuously outputs the same tag reaches a preset number of times, if so, executes the process of waiting for uploading the latest data to the cloud server at night, wherein the latest data are the characteristics of the audio stream and the holographic scene type of the audio stream, if not, and uploads the data, so that the cloud server retrains the trained AI model by using the data after receiving the data, obtains new parameters of the trained AI model, and sends the new parameters to the mobile terminal to update the parameters of the locally trained AI model.
The present example proposes a neural network-based model that can infer the type of audio stream in a mobile terminal from its content, such as music, video, navigation, alarm clock, notification, incoming call, etc.
Based on the embodiment, the mobile terminal can asynchronously judge the type of the sound source for the playing sound, solves the problem of applying the error mark audio stream type, can support all APP experience holographic audio effects of playing music, and provides great convenience and self-manual experience space for three-party development lovers and holographic fantasy.
The embodiment is applied to a holographic audio module of a mobile terminal, the holographic scene recognition is carried out on partial data content of an input audio source by utilizing an audio classification technology, a holographic scene tag of the audio source is obtained, then a holographic algorithm carries out sound image control according to the tag and the corresponding priority thereof, and therefore position distribution of each audio stream according to the scene is achieved. When the holographic scene label is obtained, the characteristics of the audio stream and the label thereof are asynchronously stored, and then the AI model is retrained when the mobile phone is charged at night, so that the AI model self-learning of the user using habit scene is achieved, and the accuracy rate of the holographic scene type identification is improved.
For example, when a user watches live broadcast with a music APP, after the application program writes down audio stream data, the audio classification technology periodically samples and extracts features from the content of the audio stream, then obtains scene tags through scene recognition, distributes sound image positions according to the tags in a later holographic algorithm, stores and periodically cleans a useful part of the features of the audio play after the audio stream is finished playing, and performs self-learning of an AI model when a user charges a mobile phone at night and is idle.
The embodiment of the application provides an audio stream identification method which is applied to electronic equipment, and comprises the steps of obtaining the audio stream of the electronic equipment, extracting characteristics of the audio stream, obtaining characteristics of the audio stream, inputting the characteristics of the audio stream into a trained scene identification model for carrying out holographic scene identification, and obtaining holographic scene types of the audio stream, namely, in the embodiment of the application, the holographic scene types of the audio stream can be obtained by inputting the audio stream of the electronic equipment into the trained scene identification model for carrying out holographic scene identification, and therefore, the holographic scene types of the audio stream are identified by utilizing the trained scene identification model, so that the determined holographic scene types are more suitable for the holographic audio, the accuracy of determining the holographic scene types under the holographic audio is improved, and the sound effect of the holographic audio is further improved.
Based on the same inventive concept as the foregoing embodiments, an embodiment of the present application provides an audio stream recognition device, where the device is disposed in an electronic device, fig. 5 is a schematic structural diagram of an alternative audio stream recognition device provided in the embodiment of the present application, and as shown in fig. 5, the audio stream recognition device may include:
A first obtaining module 51, configured to obtain an audio stream of the electronic device;
an extracting module 52, configured to perform feature extraction on the audio stream to obtain features of the audio stream;
The recognition module 53 is configured to input the features of the audio stream into the trained scene recognition model to perform holographic scene recognition, so as to obtain a holographic scene type of the audio stream.
In an alternative embodiment, when the number of audio streams is at least two, the extraction module 52 is specifically configured to:
Extracting the characteristics of each audio stream in the audio streams to obtain the characteristics of each audio stream;
Accordingly, the identification module 53 is specifically configured to:
and inputting the characteristics of each audio stream into the trained scene recognition model to respectively carry out holographic scene recognition to obtain the holographic scene type of each audio stream.
In an alternative embodiment, the apparatus is further adapted to:
Based on the holographic scene type of the audio stream, playing the spatial audio corresponding to the audio stream.
In an alternative embodiment, when the number of audio streams is one, the apparatus plays the spatial audio corresponding to the audio stream based on the holographic scene type of the audio stream, and may include:
Determining a spatial position corresponding to the holographic scene type based on the holographic scene type of the audio stream;
And playing the spatial audio of the audio stream based on the spatial position corresponding to the holographic scene type.
In an alternative embodiment, when the number of the audio streams is at least two, the apparatus plays the spatial audio corresponding to the audio stream based on the holographic scene type of the audio stream, and may include:
Determining a spatial position corresponding to each audio stream based on the priority of the holographic scene type of each audio stream in the audio streams;
and playing the spatial audio of each audio stream based on the corresponding spatial position of each audio stream.
In an alternative embodiment, the apparatus is further adapted to:
when the trained scene recognition model continuously outputs the holographic scene types of the audio stream to be the same, and the continuous output times reach the preset times, the characteristics of the audio stream and the holographic scene types of the audio stream are sent to a cloud server;
The method comprises the steps that characteristics of an audio stream and holographic scene types of the audio stream are used for training a locally trained scene recognition model by a cloud server, so that new parameters of the trained scene recognition model are obtained, and the locally trained scene recognition model is updated.
In an alternative embodiment, the apparatus is further adapted to:
and acquiring the trained scene recognition model from the cloud server.
In an alternative embodiment, the apparatus is further adapted to:
Acquiring new parameters of the trained scene recognition model from a cloud server;
and updating the parameters of the trained scene recognition model into new parameters so as to retrieve the trained scene recognition model.
In an alternative embodiment, the extraction module 52 is specifically configured to:
and extracting the mel-frequency cepstrum coefficient characteristics of the audio stream to obtain the characteristics of the audio stream.
In practical applications, the first obtaining module 51, the extracting module 52 and the identifying module 53 may be implemented by a processor located on the identifying device of the audio stream, specifically, a central Processing unit (CPU, central Processing Unit), a microprocessor (MPU, microprocessor Unit), a digital signal processor (DSP, digital Signal Processing), or a field programmable gate array (FPGA, field Programmable GATE ARRAY), etc.
An embodiment of the present application provides an audio stream recognition device, where the device is disposed in an electronic device, fig. 6 is a schematic structural diagram two of an alternative audio stream recognition device provided by the embodiment of the present application, and as shown in fig. 6, the audio stream recognition device may include:
A second acquisition module 61, configured to acquire a sample data set, where the sample data set includes an acquired audio stream and a holographic scene type of the acquired audio stream;
The training module 62 is configured to train the scene recognition model by using the sample data set, so as to obtain a trained scene recognition model;
and the sending module 63 is configured to send the trained scene recognition model to the electronic device.
In an alternative embodiment, the apparatus is further adapted to:
acquiring characteristics of an audio stream and holographic scene types of the audio stream from electronic equipment, wherein the holographic scene types of the audio stream are the types which are continuously output from a trained scene recognition model and have the same continuous output times reaching preset times;
Training the locally trained scene recognition model by utilizing the characteristics of the acquired audio stream and the acquired holographic scene type of the audio stream to obtain new parameters of the locally trained scene recognition model;
the new parameters are sent to the electronic device.
In an alternative embodiment, the device trains the locally trained scene recognition model by using the characteristics of the acquired audio stream and the acquired holographic scene type of the audio stream, and the new parameters of the trained scene recognition model are obtained by:
When the current moment reaches a preset time range, training the locally trained scene recognition model by utilizing the characteristics of the acquired audio stream and the acquired holographic scene type of the audio stream to obtain new parameters of the locally trained scene recognition model.
In practical applications, the second acquiring module 61, the training module 62 and the transmitting module 63 may be implemented by a processor located on the audio stream recognition device, specifically, a central Processing unit (CPU, central Processing Unit), a microprocessor (MPU, microprocessor Unit), a digital signal processor (DSP, digital Signal Processing), a field programmable gate array (FPGA, field Programmable GATE ARRAY), or the like.
Fig. 7 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application, and as shown in fig. 7, an embodiment of the present application provides an electronic device 700, including:
A processor 71 and a storage medium 72 storing instructions executable by the processor 71, the storage medium 72 performing operations in dependence on the processor 71 through a communication bus 73, the instructions, when executed by the processor 71, performing the method of identifying the audio stream as performed in one or more of the embodiments described above.
In practical use, the components of the electronic device 700 are coupled together via the communication bus 73. It is understood that the communication bus 73 is used to enable connected communication between these components. The communication bus 73 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as communication bus 73 in fig. 7.
Fig. 8 is a schematic structural diagram of an optional cloud server according to an embodiment of the present application, and as shown in fig. 8, an embodiment of the present application provides a cloud server 800, including:
a processor 81 and a storage medium 82 storing instructions executable by the processor 81, the storage medium 82 performing operations in dependence on the processor 81 via a communication bus 83, the instructions, when executed by the processor 81, performing the method of identifying the audio stream as performed in one or more embodiments described above.
In practical application, the components in the cloud server 800 are coupled together through the communication bus 83. It is understood that the communication bus 83 is used to enable connected communication between these components. The communication bus 83 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as communication bus 83 in fig. 8.
Embodiments of the present application provide a computer storage medium storing executable instructions that, when executed by one or more processors, perform a method for identifying an audio stream as described in one or more embodiments above, where the method is performed by a control device.
The computer readable storage medium may be a magnetic random access Memory (ferromagnetic random access Memory, FRAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable programmable Read Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or Read Only optical disk (Compact Disc Read-Only Memory, CD-ROM), etc.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the present application.

Claims (17)

1. A method of identifying an audio stream, the method being applied to an electronic device and comprising:
acquiring an audio stream of the electronic equipment;
extracting the characteristics of the audio stream to obtain the characteristics of the audio stream;
And inputting the characteristics of the audio stream into the trained scene recognition model to perform holographic scene recognition, so as to obtain the holographic scene type of the audio stream.
2. The method according to claim 1, wherein when the number of audio streams is at least two, the performing feature extraction on the audio streams to obtain features of the audio streams includes:
extracting the characteristics of each audio stream in the audio streams to obtain the characteristics of each audio stream;
Correspondingly, the inputting the characteristics of the audio stream into the trained scene recognition model for carrying out holographic scene recognition to obtain the holographic scene type of the audio stream comprises the following steps:
and inputting the characteristics of each audio stream into the trained scene recognition model to respectively carry out holographic scene recognition to obtain the holographic scene type of each audio stream.
3. The method according to claim 1, wherein the method further comprises:
and playing the space audio corresponding to the audio stream based on the holographic scene type of the audio stream.
4. The method of claim 3, wherein playing the spatial audio corresponding to the audio stream based on the holographic scene type of the audio stream when the number of the audio streams is one, comprises:
determining a spatial position corresponding to a holographic scene type based on the holographic scene type of the audio stream;
and playing the spatial audio of the audio stream based on the spatial position corresponding to the holographic scene type.
5. The method of claim 3, wherein when the number of audio streams is at least two, the playing the spatial audio corresponding to the audio stream based on the holographic scene type of the audio stream comprises:
determining a corresponding spatial position of each audio stream based on the priority of the holographic scene type of each audio stream in the audio streams;
and playing the spatial audio of each audio stream based on the corresponding spatial position of each audio stream.
6. The method according to claim 1, wherein the method further comprises:
When the trained scene recognition model continuously outputs the same holographic scene type of the audio stream and the continuous output times reach the preset times, the characteristics of the audio stream and the holographic scene type of the audio stream are sent to a cloud server;
The characteristics of the audio stream and the holographic scene type of the audio stream are used for the cloud server to train the locally trained scene recognition model, so that new parameters of the trained scene recognition model are obtained, and the locally trained scene recognition model is updated.
7. The method according to any one of claims 1 to 6, further comprising:
and acquiring the trained scene recognition model from the cloud server.
8. The method of claim 7, wherein the method further comprises:
Acquiring new parameters of the trained scene recognition model from a cloud server;
And updating the parameters of the trained scene recognition model into the new parameters so as to retrieve the trained scene recognition model.
9. The method according to any one of claims 1 to 6, wherein the performing feature extraction on the audio stream to obtain the features of the audio stream includes:
and extracting the mel-frequency cepstrum coefficient characteristics of the audio stream to obtain the characteristics of the audio stream.
10. An audio stream identification method, wherein the method is applied to a cloud server and comprises the following steps:
Acquiring a sample data set, wherein the sample data set comprises an acquired audio stream and a holographic scene type of the acquired audio stream;
training the scene recognition model by using the sample data set to obtain a trained scene recognition model;
and sending the trained scene recognition model to the electronic equipment.
11. The method according to claim 10, wherein the method further comprises:
Acquiring the characteristics of an audio stream and the holographic scene type of the audio stream from the electronic equipment, wherein the holographic scene type of the audio stream is the type which is continuously output from a trained scene recognition model and has the same continuous output times reaching a preset times;
Training the locally trained scene recognition model by utilizing the characteristics of the acquired audio stream and the holographic scene type of the acquired audio stream to obtain new parameters of the locally trained scene recognition model;
And sending the new parameters to the electronic equipment.
12. The method of claim 11, wherein training the locally trained scene recognition model using the characteristics of the acquired audio stream and the holographic scene type of the acquired audio stream to obtain new parameters of the trained scene recognition model comprises:
When the current moment reaches a preset time range, training the locally trained scene recognition model by utilizing the characteristics of the acquired audio stream and the acquired holographic scene type of the audio stream to obtain new parameters of the locally trained scene recognition model.
13. An apparatus for recognizing an audio stream, the apparatus being provided in an electronic device, comprising:
the first acquisition module is used for acquiring the audio stream of the electronic equipment;
The extraction module is used for extracting the characteristics of the audio stream to obtain the characteristics of the audio stream;
and the identification module is used for inputting the characteristics of the audio stream into the trained scene identification model to carry out holographic scene identification, so as to obtain the holographic scene type of the audio stream.
14. An audio stream recognition device, wherein the device is arranged on a cloud server, and comprises:
The system comprises a first acquisition module, a second acquisition module and a control module, wherein the first acquisition module is used for acquiring a sample data set, and the sample data set comprises an acquired audio stream and a holographic scene type of the acquired audio stream;
the training module is used for training the scene recognition model by using the sample data set to obtain a trained scene recognition model;
And the sending module is used for sending the trained scene recognition model to the electronic equipment.
15. An electronic device, comprising:
A processor and a storage medium storing instructions executable by the processor, the storage medium performing operations in dependence on the processor through a communication bus, the instructions, when executed by the processor, performing the method of identifying an audio stream as claimed in any one of claims 1 to 9.
16. A cloud server, comprising:
A processor and a storage medium storing instructions executable by the processor, the storage medium performing operations in dependence on the processor through a communication bus, the instructions, when executed by the processor, performing the method of identifying an audio stream as claimed in any one of claims 10 to 12.
17. A computer storage medium storing executable instructions which, when executed by one or more processors, perform the method of identifying an audio stream according to any one of claims 1 to 9 or perform the method of identifying an audio stream according to any one of claims 10 to 12.
CN202311452191.0A 2023-11-02 2023-11-02 Audio stream identification method, device, electronic device and computer storage medium Pending CN119943091A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311452191.0A CN119943091A (en) 2023-11-02 2023-11-02 Audio stream identification method, device, electronic device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311452191.0A CN119943091A (en) 2023-11-02 2023-11-02 Audio stream identification method, device, electronic device and computer storage medium

Publications (1)

Publication Number Publication Date
CN119943091A true CN119943091A (en) 2025-05-06

Family

ID=95531804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311452191.0A Pending CN119943091A (en) 2023-11-02 2023-11-02 Audio stream identification method, device, electronic device and computer storage medium

Country Status (1)

Country Link
CN (1) CN119943091A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095387A (en) * 2016-06-16 2016-11-09 广东欧珀移动通信有限公司 A terminal sound effect setting method and terminal
CN108141696A (en) * 2016-03-03 2018-06-08 谷歌有限责任公司 System and method for spatial audio conditioning
CN108764304A (en) * 2018-05-11 2018-11-06 Oppo广东移动通信有限公司 scene recognition method, device, storage medium and electronic equipment
CN109218528A (en) * 2018-09-04 2019-01-15 Oppo广东移动通信有限公司 Sound effect treatment method, device and electronic equipment
CN109616142A (en) * 2013-03-26 2019-04-12 杜比实验室特许公司 Apparatus and method for audio classification and processing
CN111477250A (en) * 2020-04-07 2020-07-31 北京达佳互联信息技术有限公司 Audio scene recognition method, and training method and device of audio scene recognition model
CN111641863A (en) * 2019-03-01 2020-09-08 深圳Tcl新技术有限公司 Method, system and device for playing control of surround sound and storage medium
CN113205802A (en) * 2021-05-10 2021-08-03 芜湖美的厨卫电器制造有限公司 Updating method of voice recognition model, household appliance and server

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109616142A (en) * 2013-03-26 2019-04-12 杜比实验室特许公司 Apparatus and method for audio classification and processing
CN108141696A (en) * 2016-03-03 2018-06-08 谷歌有限责任公司 System and method for spatial audio conditioning
CN106095387A (en) * 2016-06-16 2016-11-09 广东欧珀移动通信有限公司 A terminal sound effect setting method and terminal
CN108764304A (en) * 2018-05-11 2018-11-06 Oppo广东移动通信有限公司 scene recognition method, device, storage medium and electronic equipment
CN109218528A (en) * 2018-09-04 2019-01-15 Oppo广东移动通信有限公司 Sound effect treatment method, device and electronic equipment
CN111641863A (en) * 2019-03-01 2020-09-08 深圳Tcl新技术有限公司 Method, system and device for playing control of surround sound and storage medium
CN111477250A (en) * 2020-04-07 2020-07-31 北京达佳互联信息技术有限公司 Audio scene recognition method, and training method and device of audio scene recognition model
CN113205802A (en) * 2021-05-10 2021-08-03 芜湖美的厨卫电器制造有限公司 Updating method of voice recognition model, household appliance and server

Similar Documents

Publication Publication Date Title
JP2021009669A (en) Methods and devices for controlling changes in the mouth shape of 3D virtual portraits
CN111916039B (en) Music file processing method, device, terminal and storage medium
CN111081280B (en) Text-independent speech emotion recognition method and device, and generation method of algorithm model for emotion recognition
CN105874732B (en) The method and apparatus of a piece of music in audio stream for identification
US12183349B1 (en) Voice message capturing system
CN111261151B (en) Voice processing method and device, electronic equipment and storage medium
JP7230806B2 (en) Information processing device and information processing method
CN114333865B (en) Model training and tone conversion method, device, equipment and medium
WO2017151415A1 (en) Speech recognition
CN111179915A (en) Age identification method and device based on voice
CN112364144B (en) Interaction method, device, equipment and computer readable medium
WO2018076664A1 (en) Voice broadcasting method and device
CN110277092A (en) A kind of voice broadcast method, device, electronic equipment and readable storage medium storing program for executing
CN112562688A (en) Voice transcription method, device, recording pen and storage medium
CN113314145A (en) Sample generation method, model training method, mouth shape driving device, mouth shape driving equipment and mouth shape driving medium
CN109285542B (en) Voice interaction method, medium, device and system of karaoke system
CN113761865A (en) Sound and text realignment and information presentation method and device, electronic equipment and storage medium
CN110310632A (en) Method of speech processing and device and electronic equipment
CN112735381A (en) Model updating method and device
CN111339881A (en) Baby growth monitoring method and system based on emotion recognition
US12001808B2 (en) Method and apparatus for providing interpretation situation information to one or more devices based on an accumulated delay among three devices in three different languages
CN119943091A (en) Audio stream identification method, device, electronic device and computer storage medium
CN112992170A (en) Model training method and device, storage medium and electronic device
CN112185186A (en) Pronunciation correction method and device, electronic equipment and storage medium
CN114023309A (en) Speech recognition system, related method, apparatus and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination