Improved recording equipment identification algorithm
Technical Field
The invention relates to the technical field of recording equipment, in particular to an improved recording equipment identification algorithm.
Background
Sound is the most natural means of communication for humans. With the increasing maturity of audio technology, audio has been widely spread in various aspects of social life. Different brands of recording equipment manufacturers generally record audio using different digital signal processing methods and circuits, and the difference between the methods results in audio signals containing features different from other recording equipment. Therefore, the recording apparatus can be identified to some extent by analyzing the audio signal. In the judicial case, related personnel often claim that evidence is recorded by using certain equipment, so that the judgment that the equipment for recording the audio privately is an urgent problem to be solved by relevant departments of the judicial affairs.
With the development of machine learning and deep learning techniques, researchers have proposed a variety of effective machine learning and deep learning recognition models. In 2007, Christian Kraetzer et al combined with time domain and frequency domain mixed feature recognition microphone equipment, and experiments were verified by using a naive Bayes classifier and the like, and finally an identification rate of 75.99% was obtained. Robert Buchholz in 2009 used naive Bayes, logistic regression, and a support vector machine as classifiers to classify microphones, and the characteristic input of the model was the Fourier coefficients of the audio. The effectiveness of the pitch frequency, the formant frequency and the MFCC in the audio in the recording equipment identification process is verified in 2011 by encourage and the like. In 2012, the Mel-Frequency Cepstral Coefficients (MFCC) of audio is extracted by Cemal Hanilc and used as a feature, a support vector machine is used as a model classifier, 14 different telephone devices are identified, and the identification rate reaches 96.42%. In 2014 Vandana Pandey discovered that the power spectral density function of audio could distinguish microphone devices to some extent. In the same year, Ling Zou et al have demonstrated that sound recording devices can be effectively distinguished using MFCC and power-normalized cepstral coefficients (PNCC).
From the present state of research, relatively few studies have been made specifically for recording device identification. Firstly, the shortage of the characteristic database of the recording equipment is caused, with the coming of the 4G era, the brands and signals of mobile phones on the market are continuously increased, and the existing database is not updated in time. And secondly, extracting characteristic parameters of the recording equipment, wherein voice recognition related characteristics are generally adopted in the recording equipment recognition and are not specially used for the recording equipment recognition. And finally, a recording equipment identification model is adopted, the existing recording equipment identification models are all models with excellent performance in speech recognition or speaker recognition, and parameter setting and model design are not specially improved aiming at the characteristics of the recording equipment.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides an improved recording equipment identification algorithm which can overcome the problems of low identification rate and poor generalization performance of recording equipment in the prior art and can effectively identify mobile phones and computer equipment with high utilization rate in the current market.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:
an improved sound recording device identification algorithm comprising the steps of:
step S1, framing and preprocessing the audio signal to be detected;
s2, constructing a first model, wherein the first model comprises a bidirectional gate recurrent neural network layer, a unidirectional gate recurrent neural network layer and an attention layer which are sequentially arranged, and multi-dimensional frame-level features of the signals in the S1 are extracted as input of the first model;
s3, constructing a second model, wherein the second model comprises a first convolutional layer, a second convolutional layer, a third convolutional layer, a jump connection layer, a fourth convolutional layer and a global average pooling layer which are sequentially arranged, and Mel frequency spectrum characteristics of the signals in the step S1 are extracted to serve as input of the second model;
and S4, splicing and fusing the output characteristics of the first model and the second model, classifying and obtaining a recognition result.
Preferably, in step S2, 72-dimensional frame-level features are extracted, and after model one processing, 1000-dimensional feature vectors are output.
Preferably, in step S3, the output result of the first convolution layer and the output result of the third convolution layer are superimposed to be the final output of the third convolution layer.
Preferably, in step S1, the audio signal is framed, the frame length is 1024, the frame shift is 25%, and Hanning window processing is performed on the signal to extract the multi-dimensional frame-level features.
Preferably, in step S1, the audio signal is framed, the frame length is 1024, and the frame shift is 25%; calculating FFT for each frame of data, wherein the number of FFT points is 2048; and then a logarithmic Mel frequency spectrum diagram is obtained by calculation through a Mel filter bank with 80 sub-band filters.
Preferably, in step S2, the multidimensional frame-level features include a short-term zero-crossing rate, a root-mean-square energy, a fundamental frequency, a spectral centroid, a spectral spread, a spectral entropy, a spectral flux, a formant frequency, a first-order difference mel-frequency cepstral coefficient, a second-order difference mel-frequency cepstral coefficient, a linear prediction coefficient, and a Bark frequency cepstral coefficient.
Preferably, in step S2, the output S of the attention layer is expressed as P (v | x, q) expectation of the class probability distribution:
wherein the input sequence is
The corresponding request is q.
Has the advantages that: the improved recording equipment recognition algorithm has the following advantages:
1) the frame-level characteristics of the signals are introduced into a recording equipment identification algorithm, and the time sequence characteristics of the audio signals are reserved;
2) adding an attention mechanism to carry out weighted summation on the high-level features according to importance, and finally obtaining related feature parameters of the high-quality recording equipment so as to improve the robustness of the model;
3) improving a standard convolutional neural network model by adding a jump connection structure, and further improving the performance of the model;
4) and the final model fusion is realized by adopting a hidden unit splicing method, the method can improve the recognition effect of the sound recording equipment recognition and the robustness of the model, and has good application prospect.
Drawings
FIG. 1 is a schematic diagram of a model structure of an improved recognition algorithm of a recording device according to the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
As shown in fig. 1, the improved sound recording device recognition algorithm model of the present invention, specifically, the algorithm, comprises the following steps: and (1) extracting 72-dimensional frame-level characteristic parameters from each audio as input of a first model. Since the audio signal is relatively stable in a short time and non-stable in a long time, the framing is performed, and the frame length in the invention is 1024. In order to smooth the transition between two frames, it is necessary to have an overlap between the two frames, with an overlap ratio of 25%. Since framing causes spectral leakage, the signal is Hanning windowed.
And finally, extracting the features. Extracting 72-dimensional features for each frame signal, wherein the features are as follows: short-time zero-crossing rate, root-mean-square energy, fundamental frequency, spectrum centroid, spectrum diffusion, spectrum entropy, spectrum flux, formant frequency, first-order difference Mel cepstrum coefficient, second-order difference Mel cepstrum coefficient, linear prediction coefficient, Bark frequency cepstrum coefficient, and specific parameters are shown in table 1. These features are then combined together in frames, each frame has 72-dimensional speech features, and the precedence relationship between each frame of data also retains the timing information of the original audio signal. The finally obtained feature dimension is (frame number 72), and the frame number is dynamically changed along with the original audio length, so that the contradiction between the feature of the fixed dimension and the changed speech length is solved.
TABLE 1
Step (2), constructing a first model: and constructing a model I by utilizing a layer of bidirectional door circulation unit, a layer of unidirectional door circulation unit and a layer of attention layer. The recurrent neural network can well process the time sequence signal, the attention mechanism can independently learn the characteristics of the time sequence signal, and the characteristic parameters of the time sequence signal can be effectively mined by combining the recurrent neural network and the attention mechanism. Model one uses one layer of bidirectional gate cycle unit, one layer of unidirectional gate cycle unit and one layer of attention layer, and the input of the model is 72-dimensional frame-level features.
The principle of attention mechanism (attention) is to simulate the human visual attention mechanism. Suppose the input sequence is
The corresponding request is q, and the standard attention mechanism principle is to use a function f (x)
iQ) calculating a q and x
iAlignment score a between
i. All alignment scores for q with respect to x are noted as a ═ a (a)
1,a
2,…,a
n) Finally, a is mapped to a class probability distribution P (v | x, q) using a soft maximization function, which represents the selection of x according to q when v ═ i
iSuch as the following equation:
equation 2 expresses the output of attention s as P (v | x, q) expectation of the class probability distribution:
the attention mechanism can endow different importance to the local part of the same sample, automatically learn the characteristics of a time sequence signal and improve the robustness of the model. And outputting a 1000-dimensional characteristic vector by the model, overlapping the output of the model II, and finally classifying.
And (3) extracting a Mel frequency spectrum from each audio as an input of a model II. Firstly, framing an audio signal sample, wherein the frame length is 1024 and the frame shift is 25%; secondly, FFT is calculated for each frame of data, and the number of FFT points is 2048; again, a log-mel spectrogram was calculated using a mel-filter bank having 80 subband filters.
Step (4), constructing a model II: and (4) inputting the second model into the Mel frequency spectrum obtained in the step (3), adding jump connection to the first three layers of the second model which are convolution layers, connecting a layer of convolution and a layer of global average pooling, and overlapping the output result of the first layer of convolution layer with the output result of the third layer of convolution layer to form the final characteristic of the third layer.
And (5): the model I comprises a layer of bidirectional gate circulation unit, a layer of unidirectional gate circulation unit and an attention layer, and a 1000-dimensional high-level feature is finally extracted; and the first three layers of the model II are convolution layers, jump connection is added, a convolution layer and a global average pooling layer are connected, and finally a 1000-dimensional high-level feature is extracted. And splicing and fusing the output characteristics of the two models, and finally classifying.
TABLE 2 comparison of different model identification rates
| Model (model)
|
Support vector machine
|
Recurrent neural networks
|
Standard convolutional neural network
|
Model fusion
|
| Average recognition rate
|
81%
|
82.3%
|
81.5%
|
87.5% |
In conclusion, the improved recording equipment identification algorithm has the accuracy rate of 87.5%. It is characterized in that: 1) the model fusion structure improves the robustness of the system; 2) the extraction of the frame-level features can effectively mine the information of the recording equipment in the audio; 3) different importance is given to the local part of the same sample by using an attention mechanism, and the characteristics of a time sequence signal are automatically learned; 4) the underlying features are extracted using a jump join operation. Therefore, in practical application, different recording devices such as mobile phones and computers with higher utilization rate in the current market can be effectively distinguished according to the detected audio signals. The invention can overcome the problem of low recognition rate of the recognition model of the traditional recording equipment. The method can improve the recognition effect of the recording equipment recognition and the robustness of the model, and has good application prospect.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.