CN113823325B

CN113823325B - Audio rhythm detection method, device, equipment and medium

Info

Publication number: CN113823325B
Application number: CN202110622026.XA
Authority: CN
Inventors: 冯鑫
Original assignee: Tencent Technology Beijing Co Ltd
Current assignee: Tencent Technology Beijing Co Ltd
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2024-08-16
Anticipated expiration: 2041-06-03
Also published as: CN113823325A

Abstract

The application discloses an audio rhythm detection method, device, equipment and medium, wherein the method comprises the following steps: after the audio signal to be detected is obtained, performing feature processing on the audio signal based on the audio characteristics of the audio signal to obtain at least one audio segment, and then determining the rhythm type of each audio segment according to the rhythm information of each audio segment. Therefore, automatic, efficient and accurate music rhythm detection is realized, the marking of rhythms is not needed by manpower, the cost of music rhythm detection is greatly increased, the efficiency of music editing is improved, and the accuracy of music editing is improved.

Description

Audio rhythm detection method, device, equipment and medium

Technical Field

The present disclosure relates generally to the field of data processing, and in particular, to the field of audio data processing, and in particular, to a method, apparatus, device, and medium for detecting an audio tempo.

Background

With the increasing demand for audio clips and video clips, there is an increasing demand for pure audio clips as well as audio clips in video. In the related art, a professional is usually required to repeatedly listen to the audio to be clipped, and then rhythm division is performed according to own judgment, so that a great deal of manpower is required, and the accuracy is difficult to unify according to different people.

Disclosure of Invention

In view of the foregoing drawbacks or shortcomings of the prior art, it is desirable to provide an audio tempo detection method, apparatus, device and medium that enable automated, efficient and accurate music tempo detection.

In a first aspect, an embodiment of the present application provides a method for detecting an audio tempo, where the method includes:

Acquiring an audio signal to be detected;

Performing feature processing on the audio signal based on the audio characteristics of the audio signal to obtain at least one audio segment, wherein the rhythms of two adjacent audio segments are different;

And acquiring rhythm information of each audio segment, and determining the rhythm type of each audio segment based on the rhythm information.

In a second aspect, an embodiment of the present application provides an audio tempo detection apparatus, including:

The acquisition module is used for acquiring the audio signal to be detected;

The segmentation module is used for carrying out feature processing on the audio signal based on the audio characteristics of the audio signal to obtain at least one audio segment, and the rhythms of two adjacent audio segments are different;

And the determining module is used for acquiring the rhythm information of each audio segment and determining the rhythm type of the audio segment based on the rhythm information.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing a method as described in the embodiment of the present application when the program is executed by the processor.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in embodiments of the present application.

After the audio signal to be detected is obtained, the audio signal is subjected to characteristic processing based on the audio characteristics of the audio signal to obtain at least one audio segment, and then the rhythm type of each audio segment is determined according to the rhythm information of each audio segment. Therefore, automatic, efficient and accurate music rhythm detection is realized, the marking of rhythms is not needed by manpower, the cost of music rhythm detection is greatly increased, the efficiency of music editing is improved, and the accuracy of music editing is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

fig. 1 is an implementation environment architecture diagram of an audio tempo detection method according to an embodiment of the present application;

fig. 2 is a flowchart of an audio rhythm detection method according to an embodiment of the present application;

fig. 3 is a flowchart of another audio tempo detection method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of STFT feature extraction on an audio signal according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an audio STFT energy spectrum according to an embodiment of the present application;

Fig. 6 is a flowchart of another audio tempo detection method according to an embodiment of the present application;

fig. 7 is a schematic diagram of audio rhythm detection according to an embodiment of the present application;

FIG. 8 is a schematic diagram of audio cadence determination for inter-segment detection in accordance with an embodiment of the application;

FIG. 9 is a schematic diagram of an audio tempo determination for intra-segment detection according to an embodiment of the present application;

fig. 10 is a flowchart of another audio tempo detection method according to an embodiment of the present application;

fig. 11 is a schematic diagram of a method for detecting an inter-segment audio rhythm according to an embodiment of the present application;

Fig. 12 is a schematic diagram of an intra-segment audio rhythm detection method according to an embodiment of the present application;

Fig. 13 is a schematic structural diagram of an audio rhythm detection device according to an embodiment of the present application;

Fig. 14 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the application are shown in the drawings.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Key technologies to speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The automatic driving technology generally comprises high-precision map, environment perception, behavior decision, path planning, motion control and other technologies, has wide application prospect,

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The scheme provided by the embodiment of the application relates to the technology of artificial intelligence such as machine learning, and the like, and is specifically described through the following embodiment.

For a clearer description of the present application, the following is an explanation of related art terms:

Short-time fourier transform (STFT), and a mathematical transform related to fourier transform, are used to determine the frequency and phase of the local area sine wave of the time-varying signal, often used as an audio basis.

Zero crossing rate (Zero Crossing Rate, ZCR), refers to the number of times an audio signal passes through a zero point (whose sampling amplitude changes from positive to negative or from negative to positive) in each frame of the audio signal.

Chroma features (Chroma), timbre features, also called Chroma vectors. A chrominance vector is a vector of 12 elements representing the energy in 12 levels within a frame, respectively, and the same level energy for different octaves is accumulated.

Mel spectrogram (Mel-spectrogram), which is a kind of audio feature transformed based on STFT spectrogram and Mel frequency, is more representative of the performance of human ear hearing characteristics.

K-means clustering (K-means clustering algorithm ) is an iterative solution clustering analysis algorithm, which comprises the steps of pre-dividing data into K groups, randomly selecting K objects as initial cluster centers, calculating the distance between each object and each initial cluster center, and distributing each object to the cluster center nearest to the object.

With the high-speed development of the internet, the requirements for editing audio and video are increasing, and the requirements for editing are also increasing. In the related art, manual labeling is generally performed according to auditory sense of the ear of a professional. Specifically, a large number of professional annotators are hired, so that the annotators listen to the rhythm change of music at the granularity of the segment level, and then manually annotate the dividing line of the rhythm change segment according to the auditory reflection of the human ear, and judge the rhythm speed of each segment through the auditory sense of the human ear.

However, the manual labeling is required according to the auditory characteristics of the human ears, so that the labeling efficiency is extremely low, the time for one song is usually 3-5 minutes, the labeling personnel need to listen at least completely for one time to make the labeling, and when the higher labeling precision is required, the labeling personnel often need to listen more songs for more times to determine the rhythm dividing line.

Moreover, the auditory responses of different labeling personnel to the same song are different, so that different requests of different personnel for different audio rhythm detection results of the same song can occur in the labeling process, and the labeling accuracy is seriously affected.

Based on the above, the application provides an audio rhythm detection method, device, equipment and medium.

Fig. 1 is an implementation environment architecture diagram of an audio tempo detection method according to an embodiment of the present application. As shown in fig. 1, the implementation environment architecture includes: a terminal 100 and a server 200.

The audio tempo detection device may be the terminal 100 or the server 200. The terminal 100 or the server 200 acquires audio data to be annotated or video data containing the audio data to be annotated.

The process of audio tempo detection may be performed at the terminal 100 or at the server 200. The terminal 100 acquires the video or audio to be detected, further, when the process of audio rhythm detection is performed by the terminal 100, the terminal 100 directly performs rhythm detection on the video or audio to be detected after acquiring the video or audio to be detected, then obtains a rhythm detection result, when the process of audio rhythm detection is performed by the server 200, the terminal 100 sends the video or audio to be detected to the server 200, and the server 200 receives the video or audio to be detected and performs rhythm detection on the video or audio to be detected, and then obtains the rhythm detection result.

Optionally, the video or audio to be detected may be pure audio or may be video containing audio.

In addition, the terminal 100 may display an application interface through which video or audio to be detected uploaded by the user may be acquired or transmitted to the server 200.

The types of the terminal 100 include, but are not limited to, a smart phone, a tablet computer, a television, a notebook computer, a desktop computer, etc., which are not particularly limited in the embodiment of the present application.

Wherein, when the process of audio tempo detection is performed at the server 200, the server 200 transmits a tempo detection result to the terminal 100, which is displayed on the application interface by the terminal 100. Further, the server 200 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center.

The terminal 100 establishes a communication connection with the server 200 through a wired or wireless network.

Fig. 2 is a flowchart of an audio tempo detection method according to an embodiment of the present application. It should be noted that, the main implementation body of the audio rhythm detection method in this embodiment is an audio rhythm detection device, and the audio rhythm detection device may be implemented in software and/or hardware, where the audio rhythm detection device in this embodiment may be configured in an electronic device, or may be configured in a server for controlling the electronic device, where the server communicates with the electronic device to further control the electronic device.

The electronic device in this embodiment may include, but is not limited to, a personal computer, a platform computer, a smart phone, a smart speaker, and the like, and the embodiment is not particularly limited to the electronic device.

As shown in fig. 2, the audio rhythm detection method provided by the embodiment of the application includes the following steps:

step 101, an audio signal to be detected is acquired.

Wherein, the audio signal is a carrier of frequency and amplitude variation information of regular sound waves with voice, music and sound effects. Audio information may be classified into regular audio and irregular sound according to the characteristics of sound waves. Wherein the regular audio is a continuously varying analog signal, which can be represented by a continuous curve, called sound wave. The three elements of sound are pitch, intensity and timbre. There are three important parameters of sound waves or sine waves: frequency, amplitude and phase. Wherein the audio signal is a time domain signal of audio.

In the embodiment of the application, the audio signal can be a single musical piece or background music configured in video. Further, the audio signal may be selected to have different lengths according to different detection requirements, for example, when the inter-segment rhythm detection is performed, the audio signal may be a length corresponding to an entire song or a length corresponding to an entire video; in the intra-segment tempo detection, the audio signal may be a length corresponding to a certain segment of the entire song, or a length corresponding to the entire video capturing segment, or a length of an independent short video, or the like, and the length of the audio signal is not limited herein.

Step 102, performing feature processing on the audio signal based on the audio characteristics of the audio signal to obtain at least one audio segment, wherein the rhythms of two adjacent audio segments are different.

Where the audio characteristics are characteristic rules of the audio itself, such as the audio having loudness, frequency, amplitude, etc.

It should be noted that, the present application performs feature processing on an audio signal according to audio characteristics, then divides the audio signal into a plurality of audio segments according to the speed change of the audio rhythm in the audio signal, the position where the rhythm change occurs can be used as a decomposition line of the audio segments, after the audio segments are divided, the audio signal can be truncated when the rhythm change occurs each time, so that the rhythms of two adjacent audio segments are different.

Step 103, obtaining the rhythm information of each audio segment, and determining the rhythm type of the audio segment based on the rhythm information.

That is, the present application further obtains the rhythm information in the audio segment after segmenting the audio signal, and determines the rhythm type of the audio segment according to the rhythm information in the audio segment. The rhythm type includes a relaxation type and a compact type.

Therefore, after the audio signal to be detected is obtained, the audio signal is subjected to characteristic processing based on the audio characteristics of the audio signal to obtain at least one audio segment, and then the rhythm type of each audio segment is determined according to the rhythm information of each audio segment. Therefore, automatic, efficient and accurate music rhythm detection is realized, the marking of rhythms is not needed by manpower, the cost of music rhythm detection is greatly increased, the efficiency of music editing is improved, and the accuracy of music editing is improved.

In one or more embodiments, as shown in fig. 3, step 102, performing feature processing on an audio signal based on audio characteristics of the audio signal to obtain at least one audio segment, includes:

In step 1021, feature extraction is performed on the audio signal based on the audio characteristics to obtain audio feature data.

The audio characteristics include STFT characteristics, mel characteristics and zero crossing rate.

In one or more embodiments, feature extraction is performed on an audio signal based on audio characteristics to obtain audio feature data, including: respectively carrying out feature extraction on the audio signal based on the STFT characteristic, the mel characteristic and the zero crossing rate to obtain STFT frequency spectrum, mel frequency spectrum and zero crossing information corresponding to the audio signal; and carrying out characteristic splitting on the STFT spectrum to obtain an STFT energy spectrum, an STFT phase spectrum and an audio spectrum, and carrying out frequency screening on the STFT energy spectrum, the STFT phase spectrum and the mel spectrum to obtain an STFT high-frequency energy spectrum, an STFT high-frequency phase spectrum and a mel high-frequency spectrum.

When feature extraction and splitting are performed according to the STFT characteristics of the audio features, as shown in fig. 4, the audio signal may be first subjected to frame division processing to obtain audio frame information corresponding to the time sequence information, and then the audio frame information is subjected to windowing splitting and fourier transformation to obtain an STFT magnitude spectrum and an STFT phase spectrum. Further, the STFT energy spectrum and the audio spectrum can be further split based on the energy characteristics of the STFT amplitude spectrum.

Further, by observing the STFT energy spectrum and mel spectrum, it was also found that the difference in the characteristics of the different tempo segments was much smaller at low frequencies than at high frequencies, as shown in fig. 5. Therefore, the application further carries out high-frequency filtering on the obtained energy characteristics (STFT energy spectrum, STFT energy spectrum and mel frequency spectrum), namely, the characteristic frequency band which does not have resolution in the low frequency band is filtered, and only the characteristic of the high frequency band is reserved, thereby more highlighting the characteristic difference between different rhythm segments.

Optionally, the application selects a high-order channel with one third of the characteristic order as the high-frequency characteristic after filtering aiming at each characteristic needing high-frequency filtering. The STFT energy spectrum is subjected to high-frequency filtering to obtain an STFT high-frequency energy spectrum, the STFT amplitude spectrum is subjected to high-frequency filtering to obtain an STFT high-frequency amplitude spectrum, and the mel frequency spectrum is subjected to high-frequency filtering to obtain a mel high frequency spectrum. As a possible embodiment, the high-frequency filtering may be performed by using a windowing filter, or other manners capable of implementing the high-frequency filtering function of the present application, which is not limited herein.

Therefore, the application extracts the audio features at the frame level, and the granularity of the audio features at the frame level is at the millisecond level, so that the millisecond-level audio segmentation can be realized during the subsequent audio segmentation, and the precision of the audio segmentation is greatly improved.

And step 1022, performing feature stitching processing on the audio feature data to obtain target features.

In the embodiment of the application, the audio rhythm detection comprises inter-segment detection and intra-segment detection, wherein the inter-segment detection is a rhythm detection process of dividing a whole song or a whole video into a plurality of audio segments according to rhythms, and the intra-segment detection is a rhythm detection process of performing intra-segment for each divided audio segment.

When the inter-segment detection is performed, performing feature stitching processing on the audio feature data to obtain target features, including: and selecting three groups of characteristic data from zero crossing information, a voice chromatograph, an STFT phase spectrum, an STFT high-frequency energy spectrum, an STFT high-frequency amplitude spectrum and a high frequency spectrum, respectively performing intra-group splicing, and taking all the characteristics of the three groups of spliced characteristic data as target characteristics, wherein at least two groups of characteristic data respectively comprise at least two types of characteristic data.

That is, when inter-segment rhythm detection is performed, three sets of feature data can be extracted from the plurality of audio feature data, and intra-set splicing is performed on the three sets of feature data respectively, so as to obtain spliced feature data, thereby providing richer frequency domain difference information for subsequent clustering operation. Alternatively, three sets of feature data may be obtained by means of random extraction.

For example, the STFT high frequency energy spectrum and the zero crossing information may be randomly selected as the first feature data set for intra-group stitching, the STFT high frequency amplitude spectrum and the audio spectrum may be randomly selected as the second feature data set for intra-group stitching, and the mel high frequency spectrum may be randomly selected as the third feature data set for intra-group stitching, where only one feature data in the third feature data set may be stitched with or without a zero vector. Or the mel high frequency spectrum and zero crossing information can be randomly extracted as a first characteristic data set to be spliced in a group, the STFT high frequency amplitude spectrum and the STFT high frequency spectrum are randomly selected as a second characteristic data set to be spliced in a group, and then the STFT high frequency energy spectrum and the STFT phase spectrum are selected as a third characteristic data set to be spliced in a group.

It should be understood that, in the embodiment of the present application, the purpose of randomly selecting the feature data for stitching is to increase the frequency domain difference information, so that the features selected by the three sets of feature data cannot be repeated, and two features stitched in the same set cannot be identical.

When in-segment detection is performed, the audio feature data is subjected to feature splicing processing to obtain target features, including: and selecting two groups of characteristic data from zero crossing information, a voice chromatograph, an STFT phase spectrum, an STFT high-frequency energy spectrum, an STFT high-frequency amplitude spectrum and a mel high-frequency spectrum, respectively performing intra-group splicing, and taking the characteristics after the two groups of splicing as target characteristics, wherein each group of characteristic data comprises at least two types of characteristic data.

For example, the STFT high frequency energy spectrum and zero crossing information may be randomly selected for intra-group stitching as a first feature data set, and the STFT high frequency amplitude spectrum and tone spectrum may be randomly selected for intra-group stitching as a second feature data set. Or the mel high frequency spectrum and zero crossing information can be randomly extracted as a first characteristic data set to be spliced in a group, and the STFT high frequency amplitude spectrum and the audio spectrum can be randomly selected as a second characteristic data set to be spliced in the group.

It should be understood that, in the embodiment of the present application, the purpose of randomly selecting the feature data for stitching is to increase the frequency domain difference information, so that the features selected by the two sets of feature data cannot be repeated, and the two features stitched in the same set cannot be identical.

Step 1023, classifying the target features based on a clustering algorithm to obtain a clustering tag sequence corresponding to the audio signal.

Alternatively, the clustering algorithm may be a dimeric type algorithm, such as the K-means algorithm. Other machine learning clustering algorithms capable of realizing two clusters can be selected to improve the clustering accuracy and reduce the time for generating the clustering label sequence, and the application is not particularly limited herein.

In one or more embodiments, for each target feature, classifying by using a two-clustering algorithm to obtain a feature tag sequence corresponding to each target feature, combining the obtained feature tag sequences to obtain a transition tag sequence, and performing binarization processing on the transition tag sequence to obtain a clustered tag sequence.

The characteristic tag sequence obtained through the two clustering algorithm is a 0-1 sequence, 0 represents that the rhythm of the frame of audio signal is smaller than a preset threshold value, namely, slow rhythm, and 1 represents that the rhythm of the frame of audio signal is larger than or equal to the preset threshold value, namely, fast rhythm.

Specifically, when inter-segment detection is performed, three target features are obtained through three sets of feature data, the three target features are respectively classified by a dimerization algorithm to obtain three 0-1 sequences, then the three 0-1 sequences are combined to obtain a transition tag sequence, the transition tag sequence is a 0-3 sequence obtained by adding tag values of frame positions corresponding to the three 0-1 sequences, and then binarization processing is performed on the transition tag sequence, namely, the tag value corresponding to the frame position with the tag value of 0 is set to be 0, and the tag value corresponding to the frame position with the tag value of 1-3 is set to be 1, so that a final clustering tag sequence expressed by binary values is obtained.

Therefore, the application realizes rhythm detection by using the K-means algorithm to perform the biclustering on the audio features of a plurality of frame levels, can effectively improve the accuracy of the division of the audio segments to the granularity of true levels, and is more accurate than the manually marked rhythm dividing line.

Step 1024, obtaining at least one audio segment based on the succession of clustered tag sequences.

Where consecutive labels of the same category represent an audio segment, e.g., "0000000" for the first audio segment, "111" for the second audio segment, and "00" for the third audio segment when consecutive labels are 000000011100.

However, due to the specificity of the rhythm of the musical composition itself and the clustering deviation during clustering, label errors such as 000000100000 are easily generated, and at this time, "1" in the sequence is obviously abnormal, and needs to be removed to make the whole audio segment continuous.

In one or more embodiments, smoothing the clustered tag sequence to obtain a binary class curve for describing the audio signal; and taking the audio signal continuously corresponding to any class label in the binary class curve as an audio segment.

The method for smoothing the clustering label sequence to obtain a binary class curve for describing the audio signal comprises the following steps: the length of each continuous tag sequence in the cluster tag is counted, the sliding window length and the abnormal threshold value for eliminating abnormal values are determined according to the length of the continuous tag sequence, the sliding window is utilized to slide on the cluster tag sequence, the abnormal tag values on the cluster tag sequence are corrected according to the abnormal threshold value, and a binary class curve is obtained. The anomaly threshold value comprises a first anomaly threshold value of a length ratio and a second anomaly threshold value of a label value and a length ratio in the sliding window.

For example, 000000011110000111111, the lengths of two labels with label class "0" are 7 and 4, respectively, and the lengths of two labels with label class "1" are 4 and 6, respectively, and then the multiples are calculated to be 1.75 and 1.5, respectively. By performing multiple calculation on the whole section of audio signal, preferably, when the first abnormality threshold is 3, that is, when the length multiple of any two sections of label sequences is greater than or equal to 3, determining that abnormality exists in the two sections of label sequences, sliding a sliding window on the clustered label sequences, calculating the relationship between the sum of label values and the length in the sliding window, if the sum of the label values is less than 1/10, that is, the number of frames with the label value of 1 in the sliding window is less than 1/10 of the length, at this time, setting the label value corresponding to the sliding window to be 0, and if the sum of the label values is greater than 9/10, that is, the number of frames with the label value of 1 in the sliding window is greater than 9/10 of the length, at this time, setting the label value corresponding to the sliding window to be 1, and otherwise, keeping the label value corresponding to the sliding window unchanged, thereby obtaining a binary class curve. Wherein the second anomaly threshold value includes a lower limit value of 1/10 and an upper limit value of 9/10.

Therefore, the application provides a certain fault tolerance through the combination and smoothing treatment of the multiple dimerization labels, and improves the generalization capability of audio segmentation detection.

Further, as shown in fig. 6 and 7, step 103, obtaining the rhythm information of each audio, determining the rhythm type of the audio segment based on the rhythm information includes:

Step 1031, obtaining the number of re-rhythmic points within each audio segment.

The heavy rhythm point is a peak of the audio signal and is also the highest energy point in a certain area.

Step 1032, determining the corresponding rhythm density of each audio segment according to the time length of each audio segment and the number of re-rhythm points.

Wherein the tempo density may be the ratio of the number of heavy tempo points to the length of time of the audio segment.

Step 1033, determining the rhythm type of the audio segment according to the rhythm density corresponding to the audio segment.

Optionally, a tempo density threshold may be preset, and if the tempo density corresponding to the audio segment is less than the tempo density threshold, the audio segment is determined to be comfortable, and if the tempo density corresponding to the audio segment is greater than or equal to the tempo density threshold, the audio segment is determined to be compact.

Because the heavy rhythm points are rhythm information with audio characteristics, the application adopts the heavy rhythm points to determine the rhythm density of the audio, can better express the actual rhythm information type of the audio and improve the accuracy of judging the rhythm type.

In performing the intersegmental detection, further comprising: and comparing the rhythm density corresponding to the audio frequency segment with the rhythm density corresponding to the previous audio frequency segment for each audio frequency segment, and determining the rhythm change type of the audio frequency segment according to the comparison result.

That is, as shown in fig. 8, the present application further compares the rhythm densities of the front and rear audio segments after the audio segments are divided to determine the trend of the change of the rhythm of the rear audio segment with respect to the rhythm of the front audio segment, for example, the change of the rhythm of the rear audio segment is slow or fast, and is represented by slower and a master.

In performing intra-segment detection, further comprising: and respectively acquiring the rhythm densities of two adjacent audio segments, calculating the absolute value of the difference between the two rhythm densities, and determining the rhythm change trend of the two audio segments according to the absolute value of the difference and a preset trend threshold.

That is, as shown in fig. 9, at the time of intra-segment detection, the audio signal in the segment is divided again to form audio sub-segments, and then the absolute value of the difference in the tempo density of two adjacent audio sub-segments is calculated, that is, the magnitude of the change in the tempo density of two adjacent audio sub-segments is calculated, if the absolute value of the difference is smaller than a preset threshold, it is determined that the tempo of two audio sub-segments is unchanged, marked as norm, and if the absolute value of the difference is greater than or equal to the preset threshold, the tempo densities of the preceding and following audio sub-segments are further compared to determine the trend of the change in the tempo of the following audio sub-segment with respect to the tempo of the preceding audio sub-segment, for example, it is slowed down or slowed up.

Therefore, by setting up a two-stage rhythm detection mechanism, the application not only can detect rhythm transformation conditions among important audio paragraphs, but also can detect rhythm transformation conditions in each heavy segment in a finer granularity, and the two-stage rhythm detection result can provide more and finer-granularity audio rhythm transformation information for audio clips, especially audio clips in video clips, and provide more reference information for intelligent music in video clips.

The following embodiments are described below in conjunction with fig. 10 to 12.

In step 201, a complete audio signal to be detected is obtained.

Step 202, performing first feature extraction on the audio signal based on the audio characteristics to obtain audio feature data.

Wherein the audio characteristics include STFT characteristics, mel characteristics, and zero crossing rate, as shown in fig. 11, the audio characteristic data may be extracted based on the audio characteristics first: and carrying out zero crossing information, an STFT spectrogram and a mel spectrogram, then further splitting the STFT spectrogram to obtain an STFT phase spectrum, an STFT amplitude spectrum, an STFT energy spectrum and an audio spectrum, and filtering the STFT amplitude spectrum, the STFT energy spectrum and the mel spectrogram by using a high-frequency filter to finally obtain an STFT high-frequency energy spectrum, an STFT high-frequency amplitude spectrum and a mel high-frequency spectrum.

And 203, randomly extracting three groups of features from the audio feature data and performing intra-group splicing, wherein two groups of features comprise two types of audio feature data, and one group of features comprises one type of audio feature data.

For example, as shown in fig. 11, the STFT high frequency energy spectrum and zero crossing information are extracted as a first set of audio feature data and intra-set splicing is performed, the STFT high frequency amplitude spectrum and the audio spectrum are extracted as a second set of audio feature data and intra-set splicing is performed, and the mel high frequency spectrum is extracted as a third set of audio feature information.

And 204, clustering the three groups of spliced characteristic data respectively to obtain a characteristic tag sequence corresponding to each group of characteristic data.

And 205, adding tag values of the three groups of characteristic tag sequences according to the corresponding positions to obtain a transition tag sequence.

And 206, performing binarization processing on the transition tag sequence to obtain a clustering tag sequence.

Step 207, counting the length of each continuous tag sequence in the cluster tag.

Step 208, determining the sliding window length and the anomaly threshold value for eliminating the anomaly value according to the length of the continuous tag sequence.

The abnormal threshold comprises a first abnormal threshold used for judging the abnormal length ratio and a second abnormal threshold used for determining the abnormal label value in the sliding window, wherein the second abnormal threshold comprises an abnormal lower limit value and an abnormal upper limit value.

And 209, sliding on the clustering label sequence by utilizing a sliding window, and carrying out label correction according to the accumulated value of the labels in the sliding window.

If the accumulated value of the cluster labels is smaller than the lower limit value of the second abnormal threshold value, namely 1/10 of the sliding window length, setting all labels in the current window to be 0, and if the accumulated value of the cluster labels is larger than the upper limit value of the second abnormal threshold value, namely 9/10 of the sliding window length, setting all labels in the current window to be 1.

In step 210, the audio signals corresponding to the tag values that are the same in succession are used as one audio segment.

Step 211, the number of re-rhythmic points in each audio segment is obtained.

Step 212, determining the corresponding rhythm density of each audio segment according to the time length of each audio segment and the number of re-rhythm points.

Step 213, determining the rhythm type of the audio segment according to the rhythm density corresponding to the audio segment.

If the rhythm density is smaller than a preset threshold, the audio segment is determined to be comfortable, and if the rhythm density is larger than or equal to the preset threshold, the audio segment is determined to be compact.

Step 214, for any audio segment, performing a second feature extraction on the audio signal based on the audio characteristics to obtain audio feature data.

Step 215, randomly extracting two groups of features from the audio feature data and performing intra-group stitching, wherein each group of features comprises two types of audio feature data.

For example, as shown in fig. 12, mel high frequency spectrum and zero crossing information are extracted as a first set of audio feature data and intra-set splicing is performed, STFT phase spectrum and audio spectrum are extracted as a second set of audio feature data and intra-set splicing is performed.

And step 216, clustering the two groups of spliced characteristic data respectively to obtain a characteristic tag sequence corresponding to each group of characteristic data.

And step 217, adding the tag values of the two groups of characteristic tag sequences according to the corresponding positions to obtain a transition tag sequence corresponding to the audio segment.

And 218, performing binarization processing on the transition tag sequence corresponding to the audio segment to obtain a clustering tag sequence corresponding to the audio segment.

Step 219, the length of each continuous tag sequence in the cluster tags corresponding to the audio segments is counted.

Step 220, determining a sliding window length and an anomaly threshold for rejecting anomalies according to the lengths of the continuous tag sequences.

Step 221, sliding on the clustered label sequence by utilizing a sliding window, and correcting the label according to the accumulated value of the label in the sliding window.

Step 222, using the audio signals corresponding to the consecutive same tag values as an audio sub-segment.

Step 223, the number of re-tempo points within each audio sub-segment is obtained.

Step 224, determining the rhythm density corresponding to each audio sub-segment according to the time length of each audio sub-segment and the number of re-rhythm points.

Step 225, obtaining the rhythm densities of two adjacent audio sub-segments, and calculating the absolute value of the difference between the two rhythm densities.

And step 226, determining the rhythm variation trend of the two audio subsections according to the absolute value of the difference and a preset trend threshold.

If the absolute value of the difference is smaller than the preset trend threshold, determining that the two audio subsections have no rhythm change, if the absolute value of the difference is larger than the preset trend threshold, determining that the two audio subsections have rhythm change, further judging the rhythm density of the next audio subsection and the previous audio subsection, if the rhythm density of the next audio subsection is smaller than the rhythm density of the previous audio subsection, the rhythm is slower, and if the rhythm density of the next audio subsection is larger than the rhythm density of the previous audio subsection, the rhythm is faster.

Through the scheme, a certain section of audio is detected to obtain the following tag sequence:

Inter-segment cadence transformation: [ 'slower', 'master', 'slower', 'master', 'slower', 'master', 'slower',

Time interval of a segment ：[[0.0,29.174648526077096],[29.174648526077096,76.87804988662131],[7687804988662131,89.8109977324263],[89.8109977324263,98.46009070294785],[98.46009070294785,104.17197278911564],[104.17197278911564,151.87537414965988],[151.87537414965988,163.12496598639456]]

Density of cadence of segments ：[1.645264036586295,5.9953796132565165,1.0825064917076215,3.0060955626925314,1.5756627787879056,5.827676687011578,0.266676341998803]

Intra-segment rhythm transformation ：[['fast','slow','fast'],['slow','fast','slow'],['norm'],['norm'],['norm'],['slow','fast','slow'],['fast','slow']]

Audio sub-segment time interval ：[[[0.0,2.6585714285714284],[2.6585714285714284,14.448004535147392],[14.448004535147392,29.168843537414965]],[[29.174648526077096,69.36668934240363],[69.36668934240363,76.48331065759638],[76.48331065759638,76.87224489795918]],[76.87804988662131,89.80519274376417],[89.8109977324263,98.45428571428572],[98.46009070294785,104.16616780045351],[[104.17197278911564,144.3872335600907],[144.3872335600907,151.41097505668935],[151.41097505668935,151.86956916099774]],[151.87537414965988,153.45426303854876],[153.45426303854876,163.1191836734694]]]

Rhythm density of audio subsections ：[[0.7522837184309512,1.526790969275688,1.9020655001856164],[5.722526035716279,7.728386486236747,2.5711287313433573],[1.0829925958669468],[3.0081145108862453],[1.5772657547747198],[5.495426257673417,7.830584315586568,4.361155063291147],[0.6333567909922612,0.20693392895268442]]

In summary, after the audio signal to be detected is obtained, the present application performs feature processing on the audio signal based on the audio characteristics of the audio signal to obtain at least one audio segment, and then determines the rhythm type of each audio segment according to the rhythm information of each audio segment. Therefore, automatic, efficient and accurate music rhythm detection is realized, the marking of rhythms is not needed by manpower, the cost of music rhythm detection is greatly increased, the efficiency of music editing is improved, and the accuracy of music editing is improved.

It should be noted that although the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in that particular order or that all of the illustrated operations be performed in order to achieve desirable results.

With further reference to fig. 13, an exemplary block diagram of an audio tempo detection apparatus according to one embodiment of the application is shown.

As shown in the figure, the audio tempo detection apparatus 10 includes:

an acquisition module 11, configured to acquire an audio signal to be detected;

a segmentation module 12, configured to perform feature processing on the audio signal based on the audio characteristics of the audio signal, so as to obtain at least one audio segment, where the adjacent two audio segments have different rhythms;

And the comparison module 13 is used for acquiring the rhythm information of each audio segment and determining the rhythm type of the audio segment based on the rhythm information.

In some embodiments, the segmentation module 12 is further configured to: performing feature extraction on the audio signal based on the audio characteristics to obtain audio feature data;

Performing feature stitching processing on the audio feature data to obtain target features;

Classifying the target features based on a clustering algorithm to obtain a clustering tag sequence corresponding to the audio signal;

and obtaining at least one audio segment based on the continuity condition of the clustering label sequence.

In some embodiments, the audio characteristics include STFT characteristics, mel characteristics, and zero crossing rates, and the segmentation module 12 is further configured to:

performing feature extraction on the audio signal based on the STFT characteristic, the mel characteristic and the zero crossing rate respectively to obtain STFT frequency spectrum, mel frequency spectrum and zero crossing information corresponding to the audio signal;

performing characteristic resolution on the STFT spectrum to obtain an STFT energy spectrum, an STFT phase spectrum and a voice spectrum;

And respectively carrying out frequency screening on the STFT energy spectrum, the STFT amplitude spectrum and the mel frequency spectrum to obtain an STFT high-frequency energy spectrum, an STFT high-frequency amplitude spectrum and a mel high frequency spectrum.

In some embodiments, the audio tempo detection includes inter-segment detection and intra-segment detection, and the segmentation module 12 is further configured to, when performing inter-segment detection: and selecting three groups of characteristic data from the zero crossing information, the tone spectrum, the STFT phase spectrum, the STFT high-frequency energy spectrum, the STFT high-frequency amplitude spectrum and the mel high-frequency spectrum to respectively splice in groups, and taking all the characteristics of the three groups of spliced characteristic data as the target characteristic, wherein at least two groups of characteristic data respectively comprise at least two types of characteristic data.

In some embodiments, the audio tempo detection includes inter-segment detection and intra-segment detection, and the segmentation module 12 is further configured to, when performing intra-segment detection: and selecting two groups of characteristic data from the zero crossing information, the tone spectrum, the STFT phase spectrum, the STFT high-frequency energy spectrum, the STFT high-frequency amplitude spectrum and the mel high-frequency spectrum to respectively splice in groups, and taking the characteristics of the two groups of spliced characteristic data as the target characteristic, wherein each group of characteristic data comprises at least two types of characteristic data.

In some embodiments, the segmentation module 12 is further configured to: classifying each target feature by using a two-cluster algorithm to obtain a feature tag sequence corresponding to each target feature;

Combining the obtained characteristic tag sequences to obtain a transition tag sequence;

and performing binarization processing on the transition tag sequence to obtain the clustering tag sequence.

In some embodiments, the segmentation module 12 is further configured to: smoothing the clustering label sequence to obtain a binary class curve for describing the audio signal;

and taking the audio signal continuously corresponding to any one type label in the binary class curve as one audio segment.

In some embodiments, the segmentation module 12 is further configured to: counting the length of each continuous tag sequence in the cluster tag;

Determining a sliding window length and an abnormal threshold value for eliminating abnormal values according to the length of the continuous tag sequence;

And sliding on the clustering label sequence by utilizing a sliding window, and correcting the abnormal label value on the clustering label sequence according to the abnormal threshold value to obtain the binary class curve.

In some embodiments, the comparison module 13 is further configured to: acquiring the number of heavy rhythm points in each audio segment;

determining the corresponding rhythm density of each audio segment according to the time length of each audio segment and the number of the re-rhythm points;

And determining the rhythm type of the audio segment according to the rhythm density corresponding to the audio segment.

In some embodiments, the audio tempo detection includes inter-segment detection and intra-segment detection, and the comparison module 13 is further configured to, when performing inter-segment detection: comparing, for each of the audio segments, the tempo density corresponding to the audio segment with the tempo density corresponding to the preceding audio segment;

And determining the rhythm change type of the audio segment according to the comparison result.

In some embodiments, the audio tempo detection includes inter-segment detection and intra-segment detection, and the comparison module 13 is further configured to, when performing intra-segment detection: respectively obtaining rhythm densities of two adjacent audio segments, and calculating absolute values of differences of the two rhythm densities;

and determining the rhythm change trend of the two audio segments according to the absolute value of the difference and a preset trend threshold.

It should be understood that the units or modules described in the audio tempo detection device 10 correspond to the individual steps in the method described with reference to fig. 2. Thus, the operations and features described above for the method are equally applicable to the audio tempo detection apparatus 10 and the units contained therein, and are not described here again. The audio tempo detection apparatus 10 may be implemented in advance in a browser or other security application of the electronic device, or may be loaded into a browser or security application of the electronic device by means of downloading or the like. The corresponding units in the audio tempo detection apparatus 10 may cooperate with units in the electronic device to implement aspects of embodiments of the application.

The division of the modules or units mentioned in the above detailed description is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Referring now to fig. 14, fig. 14 shows a schematic diagram of a computer system suitable for use in implementing an electronic device or server of an embodiment of the application,

As shown in fig. 14, the computer system 1400 includes a Central Processing Unit (CPU) 1401, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1402 or a program loaded from a storage section 1408 into a Random Access Memory (RAM) 1403. In the RAM1403, various programs and data required for operation instructions of the system are also stored. The CPU1401, ROM1402, and RAM1403 are connected to each other through a bus 1404. An input/output (I/O) interface 1405 is also connected to the bus 1404.

The following components are connected to the I/O interface 1405; an input section 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1408 including a hard disk or the like; and a communication section 1409 including a network interface card such as a LAN card, a modem, and the like. The communication section 1409 performs communication processing via a network such as the internet. The drive 1410 is also connected to the I/O interface 1405 as needed. Removable media 1411, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1410 so that a computer program read therefrom is installed as needed into storage portion 1408.

In particular, the process described above with reference to flowchart fig. 2 may be implemented as a computer software program according to an embodiment of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program contains program code for performing the method shown in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. The above-described functions defined in the system of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 1401.

The computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation instructions of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, blocks shown in two separate connections may in fact be performed substantially in parallel, or they may sometimes be performed in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules involved in the embodiments of the present application may be implemented in software or in hardware. The described units or modules may also be provided in a processor, for example, as: a processor includes an acquisition module, a segmentation module, and a determination module. The names of these units or modules do not in any way limit the unit or module itself, for example, the acquisition module, may also be described as "acquiring the audio signal to be detected".

As another aspect, the present application also provides a computer-readable storage medium that may be included in the electronic device described in the above embodiment or may exist alone without being incorporated in the electronic device. The computer-readable storage medium stores one or more programs that when used by one or more processors perform the audio tempo detection method described in the present application.

The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present application is not limited to the specific combinations of technical features described above, but also covers other technical features which may be formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims

1. An audio tempo detection method, said method comprising:

Acquiring an audio signal to be detected;

Performing feature processing on the audio signal based on the audio characteristics of the audio signal to obtain at least one audio segment, wherein the rhythms of two adjacent audio segments are different; the processing the audio signal to obtain at least one audio segment based on the audio characteristic of the audio signal includes:

performing feature extraction on the audio signal based on the audio characteristics to obtain audio feature data;

Obtaining at least one audio segment based on the continuity of the cluster tag sequence;

2. The method of claim 1, wherein the audio characteristics include an STFT characteristic, a mel characteristic, and a zero crossing rate, wherein the feature extracting the audio signal based on the audio characteristics to obtain audio feature data comprises:

3. The method according to claim 2, wherein the audio tempo detection includes inter-segment detection and intra-segment detection, and the performing feature stitching processing on the audio feature data to obtain target features includes:

And selecting three groups of characteristic data from the zero crossing information, the tone spectrum, the STFT phase spectrum, the STFT high-frequency energy spectrum, the STFT high-frequency amplitude spectrum and the mel high-frequency spectrum to respectively splice in groups, and taking all the characteristics of the three groups of spliced characteristic data as the target characteristic, wherein at least two groups of characteristic data respectively comprise at least two types of characteristic data.

4. The method according to claim 2, wherein the audio tempo detection includes inter-segment detection and intra-segment detection, and the performing feature stitching processing on the audio feature data to obtain target features includes:

and selecting two groups of characteristic data from the zero crossing information, the tone spectrum, the STFT phase spectrum, the STFT high-frequency energy spectrum, the STFT high-frequency amplitude spectrum and the mel high-frequency spectrum to respectively splice in groups, and taking the characteristics of the two groups of spliced characteristic data as the target characteristic, wherein each group of characteristic data comprises at least two types of characteristic data.

5. The method according to claim 1, wherein classifying the target feature based on a clustering algorithm to obtain a cluster tag sequence corresponding to the audio signal comprises:

Classifying each target feature by using a two-cluster algorithm to obtain a feature tag sequence corresponding to each target feature;

6. The method of claim 1, wherein the deriving at least one audio segment based on the succession of clustered tag sequences comprises:

smoothing the clustering label sequence to obtain a binary class curve for describing the audio signal;

7. The method of claim 6, wherein smoothing the clustered tag sequence to obtain a binary class curve describing the audio signal comprises:

Counting the length of each continuous tag sequence in the clustered tag sequences;

8. The method of claim 1, wherein the obtaining the tempo information for each of the audio segments and determining the tempo type for the audio segment based on the tempo information comprises:

acquiring the number of heavy rhythm points in each audio segment;

9. The method of claim 8, wherein the audio tempo detection comprises inter-segment detection and intra-segment detection, and wherein the method further comprises, in performing inter-segment detection:

comparing, for each of the audio segments, the tempo density corresponding to the audio segment with the tempo density corresponding to the preceding audio segment;

10. The method of claim 8, wherein the audio tempo detection comprises inter-segment detection and intra-segment detection, and wherein the method further comprises, upon intra-segment detection:

Respectively obtaining rhythm densities of two adjacent audio segments, and calculating absolute values of differences of the two rhythm densities;

11. An audio tempo detection device, said device comprising:

The acquisition module is used for acquiring the audio signal to be detected;

The segmentation module is further configured to: performing feature extraction on the audio signal based on the audio characteristics to obtain audio feature data;

12. The apparatus of claim 11, wherein the audio characteristics include an STFT characteristic, a mel characteristic, and a zero crossing rate, the segmentation module further configured to:

13. The apparatus of claim 12, wherein the audio tempo detection comprises inter-segment detection and intra-segment detection, and wherein the segmentation module is further configured to, when performing inter-segment detection: and selecting three groups of characteristic data from the zero crossing information, the tone spectrum, the STFT phase spectrum, the STFT high-frequency energy spectrum, the STFT high-frequency amplitude spectrum and the mel high-frequency spectrum to respectively splice in groups, and taking all the characteristics of the three groups of spliced characteristic data as the target characteristic, wherein at least two groups of characteristic data respectively comprise at least two types of characteristic data.

14. The apparatus of claim 12, wherein the audio tempo detection comprises inter-segment detection and intra-segment detection, and wherein the segmentation module is further configured to, when performing intra-segment detection: and selecting two groups of characteristic data from the zero crossing information, the tone spectrum, the STFT phase spectrum, the STFT high-frequency energy spectrum, the STFT high-frequency amplitude spectrum and the mel high-frequency spectrum to respectively splice in groups, and taking the characteristics of the two groups of spliced characteristic data as the target characteristic, wherein each group of characteristic data comprises at least two types of characteristic data.

15. The apparatus of claim 11, wherein the segmentation module is further configured to: classifying each target feature by using a two-cluster algorithm to obtain a feature tag sequence corresponding to each target feature;

16. The apparatus of claim 11, wherein the segmentation module is further configured to: smoothing the clustering label sequence to obtain a binary class curve for describing the audio signal;

17. The apparatus of claim 16, wherein the segmentation module is further configured to: counting the length of each continuous tag sequence in the clustered tag sequences;

18. The apparatus of claim 11, wherein the determining module is further configured to: acquiring the number of heavy rhythm points in each audio segment;

19. The apparatus of claim 18, wherein the audio tempo detection comprises inter-segment detection and intra-segment detection, and wherein the determining module is further configured to, when performing inter-segment detection: comparing, for each of the audio segments, the tempo density corresponding to the audio segment with the tempo density corresponding to the preceding audio segment;

20. The apparatus of claim 18, wherein the audio tempo detection comprises inter-segment detection and intra-segment detection, and wherein the determining module is further configured to, when performing intra-segment detection: respectively obtaining rhythm densities of two adjacent audio segments, and calculating absolute values of differences of the two rhythm densities;

21. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the audio tempo detection method according to any one of claims 1-10 when executing the program.

22. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements an audio tempo detection method according to any one of claims 1-10.