Disclosure of Invention
The invention provides an automatic identification method and an automatic identification system for multi-mode remote controller equipment, which are used for solving the problem that the identification accuracy of the multi-mode remote controller equipment for a use scene is low.
In order to solve the above technical problems, the present invention provides an automatic identification method for a multi-mode remote controller device, including:
acquiring gesture data, environment parameters, voice data and key data of a user;
Performing numerical vectorization according to the user gesture data, the environment parameters, the voice data and the key data to obtain initial characteristics of a user;
carrying out standardization according to the initial characteristics of the user, and carrying out principal component analysis to obtain the characteristics of the dimension-reducing user;
carrying out weighted fusion according to the dimension reduction user characteristics and a preset modal contribution degree matrix to obtain fusion characteristic vectors;
inputting the fusion feature vector into a pre-trained mode prediction model, and outputting a prediction standby mode;
And judging the confidence degree according to the predicted standby mode and the pre-acquired current use mode, and generating a mode switching instruction.
In an optional implementation manner, the performing numerical vectorization according to the gesture data of the user, the environmental parameter, the voice data and the key data to obtain initial characteristics of the user includes:
Extracting an acceleration mean value, an acceleration standard deviation and an acceleration peak value according to the gesture data of the user, and constructing a gesture feature vector;
Positioning according to the environmental parameters to obtain current relative coordinates;
matching to obtain a current region code according to a pre-stored usage scene map and the current relative coordinates;
performing single-hot coding on the current region code to obtain a region feature vector;
extracting voice characteristics according to the voice data to obtain voice characteristic vectors;
performing binarization coding according to the key data to obtain a key coding sequence;
And carrying out vector fusion according to the gesture feature vector, the region feature vector, the voice feature vector and the key coding sequence to obtain initial features of a user.
In an alternative embodiment, the normalizing according to the initial feature of the user and the principal component analysis are performed to obtain the feature of the dimension-reduced user, which includes:
normalizing according to the user initial characteristics to obtain normalized initial characteristics;
covariance calculation is carried out based on the standardized initial characteristics, and a covariance matrix is obtained;
extracting a characteristic value and a corresponding characteristic vector according to the covariance matrix;
descending order arrangement is carried out according to the size of the characteristic values, and a projection matrix is constructed by selecting a preset number of characteristic vectors from large to small;
And carrying out data projection according to the projection matrix and the standardized initial characteristic to obtain the dimension-reduced user characteristic.
In an optional implementation manner, the performing weighted fusion according to the dimension-reduced user feature and a preset modal contribution matrix to obtain a fusion feature vector includes:
Multiplying the dimension-reduced user characteristics by a preset modal contribution matrix to obtain an initial fusion vector;
And normalizing according to the initial fusion vector to obtain a fusion feature vector.
In an alternative embodiment, the training process of the mode prediction model includes:
training the modal prediction model based on a pre-training multi-modal model, taking a history feature vector as model input data, and taking output data as a communication modal label corresponding to the history feature vector;
performing parameter iteration of the model by adopting gradient descent;
And when the loss function of the model is detected to be in accordance with a preset training index, or the training frequency reaches the preset frequency upper limit, obtaining the trained model.
In an optional implementation manner, the generating a mode switching instruction according to the confidence judgment of the predicted standby mode and the pre-acquired current use mode includes:
acquiring performance data of a current using mode;
performing performance weight calculation according to the performance data to obtain the performance weight of the current mode;
The modality confidence is calculated by the following formula:
C=α·Ppredict+γ·Wcurrent+β·Δenv
Wherein, C represents the mode confidence, P predict represents the prediction probability of the prediction standby mode, W current represents the performance weight of the current mode, delta env represents the environmental change coefficient, and alpha, gamma and beta represent the preset coefficient;
When the mode confidence coefficient is larger than a preset confidence coefficient threshold value, triggering a mode switching instruction, and taking the predicted standby mode as a using mode at the next moment;
And when the mode confidence is smaller than or equal to the confidence threshold, continuing to use the current use mode as the use mode at the next moment.
In an optional implementation manner, the calculating the performance weight according to the performance data to obtain the performance weight of the current mode includes:
The performance weights are calculated by the following formula:
Wherein W current represents the performance weight of the current mode, S represents the signal strength, S th represents the signal strength threshold, BER represents the bit error rate, and D norm represents the normalized delay coefficient.
In a second aspect, the present invention provides an automatic identification system for a multi-mode remote control device, comprising:
the data acquisition module is used for acquiring gesture data, environment parameters, voice data and key data of a user;
the initial feature module is used for carrying out numerical vectorization according to the user gesture data, the environment parameters, the voice data and the key data to obtain initial features of the user;
the data dimension reduction module is used for carrying out standardization according to the initial characteristics of the user and carrying out principal component analysis to obtain dimension reduction user characteristics;
the modal fusion module is used for carrying out weighted fusion according to the dimension reduction user characteristics and a preset modal contribution matrix to obtain fusion characteristic vectors;
The mode prediction module is used for inputting the fusion feature vector into a pre-trained mode prediction model and outputting a prediction standby mode;
and the mode switching module is used for judging the confidence degree according to the predicted standby mode and the pre-acquired current use mode and generating a mode switching instruction.
In a third aspect, the present invention further provides an electronic device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the method for automatically identifying a multi-mode remote control device according to any one of the above when the computer program is executed by the processor.
In a fourth aspect, the present invention further provides a computer readable storage medium, where the computer readable storage medium includes a stored computer program, where when the computer program runs, the device where the computer readable storage medium is controlled to execute the method for automatically identifying a multi-mode remote controller device according to any one of the above.
Compared with the prior art, the invention has the following beneficial effects:
(1) The process of acquiring user gesture data, environmental parameters, voice data, and key data ensures comprehensive capture of user behavior and environmental status. The operation habit and the environmental information of the user are acquired in real time through various sensors, so that rich original data is provided for subsequent data processing. The process not only improves the accuracy and the integrity of the data, but also lays a foundation for personalized service, and is beneficial to improving the user experience.
(2) And carrying out numerical vectorization according to the gesture data of the user, the environmental parameters, the voice data and the key data to obtain initial characteristics of the user. And the advanced signal processing technology is adopted to convert different types of data into a unified numerical vector form, so that subsequent calculation is facilitated. The numerical vectorization process can effectively integrate multi-source heterogeneous data, so that different types of input information can be analyzed under the same frame, and the efficiency and compatibility of data processing are remarkably improved.
(3) And carrying out standardization according to the initial characteristics of the user, and carrying out principal component analysis to obtain the dimension-reduced user characteristics. The method comprises the steps of carrying out standardization processing on initial characteristics of a user, eliminating scale differences among different characteristics, reducing data dimension by Principal Component Analysis (PCA), and retaining the most representative characteristic components. The method not only reduces the computational complexity, but also removes redundant information, improves the effectiveness of feature extraction, and enhances the generalization capability of the model.
(4) And carrying out weighted fusion according to the dimension reduction user characteristics and a preset modal contribution degree matrix to obtain fusion characteristic vectors. And combining the user characteristics after dimension reduction with preset contribution degree weights of all modes, and generating a comprehensive characteristic vector by using a weighted fusion strategy. The fusion mode fully considers the importance of each mode and the influence of each mode on the final decision, so that the fusion feature vector can reflect the real intention of the user, and the accuracy of the prediction result is improved.
(5) And inputting the fusion feature vector into a pre-trained mode prediction model, and outputting a prediction standby mode. And analyzing the fusion feature vector by utilizing a mode prediction model trained by a deep learning or machine learning algorithm to predict a standby mode which is most suitable for the current situation. The intelligent level of mode switching is improved, and the scene recognition accuracy is improved.
(6) And judging the confidence degree according to the predicted standby mode and the pre-acquired current use mode, and generating a mode switching instruction. Based on the performance and the prediction probability between the predicted standby mode and the currently used mode, the system performs confidence assessment, and when a certain condition is met, a mode switching instruction is triggered. The method ensures the rationality and timeliness of the mode switching and improves the accuracy of scene recognition and mode switching.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a first embodiment of the present invention provides an automatic recognition method of a multi-mode remote controller device, comprising the steps of:
s11, acquiring gesture data, environment parameters, voice data and key data of a user;
s12, carrying out numerical vectorization according to the gesture data of the user, the environmental parameters, the voice data and the key data to obtain initial characteristics of the user;
S13, carrying out standardization according to the initial characteristics of the user, and carrying out principal component analysis to obtain the dimension-reduced user characteristics;
s14, carrying out weighted fusion according to the dimension reduction user characteristics and a preset modal contribution matrix to obtain fusion characteristic vectors;
s15, inputting the fusion feature vector into a pre-trained mode prediction model, and outputting a prediction standby mode;
S16, judging the confidence degree according to the predicted standby mode and the pre-acquired current use mode, and generating a mode switching instruction.
In step S11, user gesture data, environmental parameters, voice data, and key data are acquired.
In one embodiment, user gesture data is collected in real time by a triaxial accelerometer and a gyroscope, the sampling frequency of the sensor is set to be 100Hz, acceleration (unit: m/s 2) and angular velocity (unit: rad/s) are recorded, the data is subjected to low-pass filtering (cut-off frequency of 5 Hz) to remove noise, and then the data is stored in a binary format in a local cache of the device, and a time stamp (millisecond) and a User Unique Identifier (UUID) are added. The environmental parameters are obtained through integrating a temperature and humidity sensor (temperature precision +/-0.5 ℃ and humidity precision +/-3% RH), an illumination sensor (unit: lux) and a Bluetooth beacon positioning module, the positioning precision reaches +/-10 cm through triangulation, and data are stored in an edge computing device in a JSON format and updated once per second. Voice data is collected by the MEMS microphone, the sampling rate is 16kHz, 16 bits are quantized, the voice data is compressed into an Opus format after framing (25 ms frame length, 10ms frame shift), the local buffer only retains the data of 30 seconds recently, and the voice data is transmitted to the cloud by AES-256 encryption. The key data is recorded by a capacitive touch module, the key pressing time (unit: ms), duration and key sequence are stored in a nonvolatile memory (NVM) of the device in a binary code mode, and are synchronized to a cloud database every minute. All data stores follow the privacy protection principle, the original voice and gesture data are only processed locally, and the historical data retention period does not exceed 7 days to meet GDPR requirements.
It is worth noting that in an intelligent home environment, in order to optimize the selection of communication modes (such as infrared, bluetooth and star flash), the system needs to collect various types of data, wherein user gesture data can be captured through a camera or a wearable device, for example, acceleration and direction change generated when a user swings hands to control light, environment parameters comprise Wi-Fi signal intensity, temperature, humidity and the like, real-time monitoring is performed through a sensor network, for example, the star flash is used in an open space, the coverage area is wide, the anti-interference capability is high, voice data are acquired by a microphone array and are used for identifying user instructions, for example, the voice command is transmitted through Bluetooth in a quiet room to control an intelligent sound box to play music, the interaction is suitable for the low-delay characteristic of the voice data, the key data are derived from physical buttons or touch screen input, and are suitable for accurate operation under specific scenes, for example, the infrared remote controller is used for adjusting television volume, and the characteristic of high directivity of the television is utilized. One specific usage scenario is that when a user moves to different areas at home, the system dynamically evaluates and switches to a communication mode most suitable for the current requirements according to gestures (such as waving a song) of the user, the current environment (such as detecting a Wi-Fi area with high interference to select a star flash to keep stable connection), voice commands (such as sending voice commands to inquire about a recipe through Bluetooth in a kitchen) and key operations (such as closing a living room device with an infrared remote controller), so that smoothness and reliability of user experience are ensured. The fusion and analysis of these multimodal data enables the smart home system to more intelligently adapt to the user's daily behavior and preferences.
In step S12, numerical vectorization is performed according to the user gesture data, the environmental parameter, the voice data, and the key data, so as to obtain initial characteristics of the user.
In one embodiment, an acceleration mean value, an acceleration standard deviation and an acceleration peak value are extracted according to the gesture data of the user, and a gesture feature vector is constructed;
Positioning according to the environmental parameters to obtain current relative coordinates;
matching to obtain a current region code according to a pre-stored usage scene map and the current relative coordinates;
performing single-hot coding on the current region code to obtain a region feature vector;
extracting voice characteristics according to the voice data to obtain voice characteristic vectors;
performing binarization coding according to the key data to obtain a key coding sequence;
And carrying out vector fusion according to the gesture feature vector, the region feature vector, the voice feature vector and the key coding sequence to obtain initial features of a user.
It is worth to say that when the mean value, standard deviation and peak value of acceleration are extracted, the low-pass filtering is firstly carried out on the original acceleration data to eliminate high-frequency noise, the acceleration peak value is defined as the maximum value of absolute value to eliminate the directional influence, a sample standard deviation formula is adopted when the standard deviation is calculated to ensure the effectiveness of statistics, and Z-score standardization is carried out on each statistic before the feature vector is constructed to eliminate the dimension difference.
In one embodiment, the positioning parameters are obtained via a bluetooth beacon. For example, a device Receives Signal Strengths (RSSI) of a plurality of bluetooth beacons (preset at different locations in a room) and calculates its two-dimensional coordinates relative to the beacon coordinate system using a triangulation algorithm. Assuming that beacon a is located at (0, 0), beacon B is located at (3, 0), and beacon C is located at (0, 3), the device determines the current coordinates (e.g., (1.2,1.8)) from the respective beacon RSSI values by trilateration. Next, the calculated coordinates (e.g., x=1.2, y=1.8) are matched with a pre-stored usage scenario map. The map is a digital plan view, dividing the space into a plurality of areas (such as "living room", "bedroom", "kitchen"), each area having a unique code (such as living room=1, bedroom=2, kitchen=3). When matching, the area is searched according to the coordinates, for example, if the coordinate range of the 'living room' in the map is X epsilon [0,2] and Y epsilon [0,2], the current coordinate (1.2,1.8) belongs to the living room, and the area code is 1. Finally, the region code (e.g., 1) is converted into a vector of One-Hot Encoding (One Encoding).
It is noted that the process of one-hot encoding includes first determining the total number of all region classes (e.g., assuming region numbers 1 to 5 for a total of 5 regions), each region corresponding to an independent binary dimension; then, setting the position corresponding to the target region code as 1 (for example, setting the first element as 1 when the region code is 1 to obtain [1,0 ]), and finally, ensuring that each region code only activates the position of the vector uniquely corresponding to each region code and the rest positions are kept at 0, thereby forming independent codes which are not interfered with each other. For example, if the region code is 3 and there are a total of 4 regions, then the result of the encoding is [0, 1,0], in this way eliminating the potential misleading of numerical order to the model, enabling the machine learning algorithm to accurately identify discrete region categories.
It should be noted that the conversion formula of the beacon RSSI value to the distance is as follows:
Where d represents the actual distance, d 0 represents the reference distance (100 cm), RSSI 0 represents the RSSI value (40 dBm) corresponding to the reference distance, RSSI represents the actual signal reception strength, and n represents the path loss index (2).
It is worth to describe that the triangulation algorithm is to set up an equation set by assuming the unknown number of the two-dimensional coordinates and combining the known beacon point coordinates and the distance, and solve the unknown number.
It is worth noting that the collected voice data is first preprocessed by removing low frequency noise through high pass filtering (cut-off frequency 100 Hz), and then enhancing high frequency information by pre-emphasis (coefficient 0.97). The speech signal is then framed (frame length 25ms, frame shift 10ms, corresponding to 400 and 160 sample points), each frame being multiplied by a hamming window to eliminate edge effects. And carrying out fast Fourier transform (FFT, point number 512) on each frame of signal to obtain a frequency spectrum, calculating an energy spectrum, and carrying out weighted summation through a Mel filter bank (such as 20-40 triangular filters and 20Hz-4kHz coverage) to obtain Mel frequency energy values. After the logarithm of the energy of each filter bank, discrete Cosine Transform (DCT) is performed, and the first 13 coefficients are extracted as MFCCs (mel-frequency cepstral coefficients). A 13-dimensional speech feature vector is formed for subsequent processing.
It should be noted that, for key data, key actions are first converted into binary codes, assuming that the device has 4 physical keys (e.g., up, down, left, and right), and the pressed state of each key is represented by 0 (not pressed) or 1 (pressed). For example, if "left key press" is currently detected and the other keys are not activated, then the code is [0, 1,0] (in the order up, down, left, right). For example, if the current time is 15:00:00, statistics are made of key records between 14:59:58 and 15:00:00. Then, it is determined for each key (e.g. "up", "down", "left", "right") whether it has been pressed, that the corresponding position is marked 1 as long as it is triggered at least once within the window, and 0 otherwise. For example, if the user presses the "left" key and the "up" key sequentially within the window, a binary vector is generated as [ up=1, down=0, left=1, right=0 ] (i.e., [1,0,1,0 ]), and pressing the same key multiple times (e.g., 3 times for the left key) is recorded as 1 only. The final output is always a fixed dimension vector (e.g., 4-dimensional) corresponding to the total number of keys, even if there is no key activation, all 0. The method converts the continuous key sequence into stable characteristic vectors through time window aggregation short-time operation, for example, the combination of 'left- & gt up' can be coded as [1,0,1,0], so that the subsequent analysis is facilitated.
It should be noted that, according to the sequence of the gesture feature vector, the region feature vector, the voice feature vector and the key code sequence, all the dimension features are spliced to form a high-dimension user initial feature vector. Each sub-feature is required to be subjected to standardized processing (such as Z-score) to eliminate dimension differences, and then is integrated into a unified representation in a simple splicing mode, and finally the obtained vector synthesizes multi-mode information of user gestures, positions, voices and key operations to provide input for subsequent dimension reduction and mode prediction.
In step S13, normalization is performed according to the user initial feature, and principal component analysis is performed, so as to obtain a dimension-reduced user feature.
In one embodiment, the user initial feature is normalized to obtain a normalized initial feature;
covariance calculation is carried out based on the standardized initial characteristics, and a covariance matrix is obtained;
extracting a characteristic value and a corresponding characteristic vector according to the covariance matrix;
descending order arrangement is carried out according to the size of the characteristic values, and a projection matrix is constructed by selecting a preset number of characteristic vectors from large to small;
And carrying out data projection according to the projection matrix and the standardized initial characteristic to obtain the dimension-reduced user characteristic.
The method is characterized in that a Z-score method is adopted in the standardization process, specifically, for each dimension of initial characteristics of a user, training data are used for calculating the mean value (mu) and standard deviation (sigma) of the dimension, an original characteristic value x is converted into (x-mu)/sigma, the standardized characteristic mean value is 0, the standard deviation is 1, mu and sigma are calculated and stored in advance through historical training data, the data dimension consistency is ensured by direct application during deployment, and if the standard deviation of a certain dimension is zero (unchanged), a minimum threshold value (such as 1X10 -8) is set to avoid zero errors, and meanwhile the original data distribution characteristic is reserved.
It should be noted that, by calculating the eigenvalue and eigenvector of the covariance matrix, the principal component direction of the data can be determined, i.e. the eigenvector points to the direction of the greatest data change, and the size of the corresponding eigenvalue represents the variance contribution in that direction. After the eigenvalues are arranged in descending order, the cumulative variance contribution rate of the first k eigenvalues (defined as the ratio of the sum of the first k terms to the sum) is calculated, and a change curve (i.e., a lithotripter) is drawn in which the cumulative variance contribution rate increases with the k value. And observing an inflection point of an elbow in the curve, namely, when k is increased, accumulating the position where the variance increase is obviously slowed down, wherein the corresponding k value is the preset dimension reduction target dimension.
It should be noted that each column of the projection matrix is a feature vector, and the number of rows corresponds to the original feature number. When matrix multiplication is performed, normalized data (sample×feature) is multiplied by a projection matrix (feature×principal component number), resulting in sample×principal component number dimension reduction data. The data projection process is to normalize the initial feature and the projection matrix to perform matrix multiplication.
In step S14, the weighted fusion is performed according to the dimension-reduced user feature and a preset modal contribution matrix, so as to obtain a fusion feature vector.
In one embodiment, multiplying the dimension-reduced user features by a preset modal contribution matrix to obtain an initial fusion vector;
And normalizing according to the initial fusion vector to obtain a fusion feature vector.
In one embodiment, the mode contribution matrix is set by historical data, and the rows correspond to the communication modes (such as infrared, bluetooth and star flash), the columns correspond to the sensor feature types (gestures, environment and voice/key), and each cell value represents the contribution weight of the feature to the modes. For example, the gesture motion has a weight of 0.4 for infrared, 0.2 for bluetooth, 0.1 for star flash, the environmental parameters (e.g. distance, signal strength) have a weight of 0.3 for infrared, 0.5 for bluetooth, 0.6 for star flash, the voice/key affects only bluetooth (0.3) and star flash (0.4), and has no correlation for infrared (weight 0). Examples of the matrix are that the weights of the gesture to infrared, bluetooth and star flash are respectively 0.4, 0.2 and 0.1, the corresponding weights of the environmental parameters are 0.3, 0.5 and 0.6, and the weights of the voice/key pair Bluetooth and star flash are 0.3 and 0.4. During calculation, the standardized value of each sensor characteristic (for example, the gesture acceleration is standardized to 0.8, the environmental parameter is 0.5) is multiplied by the corresponding weight and summed to obtain the total contribution degree of each mode (for example, the infrared is 0.8x0.4+0.5x0.3=0.47), and finally, the mode with the highest contribution degree is selected as a communication mode, so that the multi-mode decision based on the dynamic weight is realized.
It should be noted that, the mode contribution degree matrix is a weight table for quantifying the importance degree of different interaction modes (such as gestures, environment awareness, voice and keys) in the current scene, and the core function is to dynamically adjust the decision weight of each mode according to the historical data and the real-time environment. The generation of the matrix is based on two parts of data, namely performance records of historical communication modes (such as identification accuracy, response speed, anti-interference capability and the like of each mode under different scenes) and influence of current environment parameters (such as illumination, noise and user position) on mode effectiveness. For example, if the history data shows that the success rate of speech recognition is reduced in a noisy environment, the matrix will reduce the weight of the speech mode, whereas if the user is in an area that can be clearly operated by the gesture, the weight of the gesture mode is increased. Each numerical value in the matrix represents the contribution proportion of the corresponding mode under the current condition, so that the most reliable and most effective mode information under the current scene is preferentially considered when the multi-mode features are fused, and finally, a more accurate feature vector is generated through weighted fusion.
In step S15, the fusion feature vector is input into a pre-trained mode prediction model, and a predicted standby mode is output.
In one embodiment, the training process of the modal prediction model includes:
Training the modal prediction model based on a pre-training multi-modal model, taking a history feature vector as model input data and outputting data as a communication modal label corresponding to the history feature vector, carrying out parameter iteration of the model by adopting gradient descent, and obtaining a trained model when detecting that a loss function of the model meets a preset training index or the training frequency reaches a preset frequency upper limit.
It should be noted that, firstly, a training set is constructed based on historical interaction data, and the training set is input as a normalized and fused feature vector (including multi-mode information such as gestures, environments, voice/keys) in a historical scene and is output as a communication mode label (such as infrared, bluetooth and star flash) actually used in a corresponding scene. Secondly, a pre-trained multi-modal model (such as a multi-task learning framework based on a transducer) is adopted as an infrastructure, and the dimension of an output layer (such as a class 3 modality) is adjusted according to task requirements. During training, input feature vector batches are sent into a model, the difference between the cross entropy loss function and an actual label is calculated through forward propagation prediction mode probability distribution, gradient descent is carried out by using an Adam optimizer at a preset learning rate (such as 0.001), and model parameters are updated iteratively. And continuously monitoring the loss value of the verification set in the training process, and if the loss drop amplitude of continuous 5 periods is smaller than 0.001 (preset convergence threshold), or the training round reaches the upper limit of 1000 times, stopping training, and finally outputting the optimized model parameters to complete the construction of the modal prediction model. The process combines the multi-mode representation capability of the pre-training model with the supervision signals of the historical data, and the model learns to extract the decision basis of the mode selection from the fusion characteristics through iterative optimization.
It should be noted that, assuming that the current mode of use is infrared, assuming that in the current scenario, the user performs a "quick waving" gesture (the X-axis peak of the acceleration sensor is 12m/s 2, which is 0.8 after normalization), and the gesture is located in an open area (Wi-Fi signal strength-45 dBm, which is 0.6 after normalization), the environmental interference is low (bluetooth interference index 0.1, which is-0.5), and the voice command is not triggered (the voice activation state is 0, which is still 0 after normalization). After the model is input into a pre-trained mode prediction model, the model output Softmax probability distribution is that the Bluetooth mode probability is 0.65, the star flash mode probability is 0.25 and the infrared mode probability is 0.1. According to the rule, the standby mode selects the non-current mode (namely Bluetooth) with the highest probability, so that the standby mode is predicted to be Bluetooth, the confidence coefficient is 0.65, and the total confidence coefficient is calculated by combining the real-time performance weight and the environment change coefficient to determine whether to switch.
In step S16, confidence judgment is performed according to the predicted standby mode and the current use mode obtained in advance, so as to generate a mode switching instruction.
In one embodiment, performance data of a currently used modality is obtained;
performing performance weight calculation according to the performance data to obtain the performance weight of the current mode;
The modality confidence is calculated by the following formula:
C=α·Ppredict+γ·Wcurrent+β·Δenv
Wherein, C represents the mode confidence, P predict represents the prediction probability of the prediction standby mode, W current represents the performance weight of the current mode, delta env represents the environmental change coefficient, and alpha, gamma and beta represent the preset coefficient;
When the mode confidence coefficient is larger than a preset confidence coefficient threshold value, triggering a mode switching instruction, and taking the predicted standby mode as a using mode at the next moment;
And when the mode confidence is smaller than or equal to the confidence threshold, continuing to use the current use mode as the use mode at the next moment.
It is worth to say that the value of alpha is 0.3, the value of gamma is 0.2, and the value of beta is 0.3, the method is not limited to this, and the requirement that alpha+gamma+beta is less than or equal to 1 is satisfied.
It is worth to say that the mode confidence coefficient is calculated through three types of indexes, namely (1) prediction probability (such as standby mode probability output by a model, for example, star flash mode probability 0.5), (2) current mode performance weight (indexes such as comprehensive signal strength, bit error rate, delay and the like, for example, bluetooth current weight 0.48), and (3) environment change coefficient (1 when a user moves or switches areas, or 0). When parameters are fused, for example, when the modality confidence is calculated to be 0.792. If the mode confidence coefficient is preset to a threshold value (such as 0.7), the mode switching (such as switching from Bluetooth to star flash) is triggered, otherwise, the current mode is continuously used. The process combines model prediction, real-time performance evaluation and environmental mutation detection, and the accuracy and timeliness of dynamic balance switching.
In one embodiment, the performance weight is calculated by the following formula:
Wherein W current represents the performance weight of the current mode, S represents the signal strength, S th represents the signal strength threshold, BER represents the bit error rate, and D norm represents the normalized delay coefficient.
It is worth noting that the parameters required for calculating the performance weight of the current mode are obtained by reading the signal strength (such as the dBm value of Wi-Fi or the RSSI value of bluetooth) from a hardware sensor or a system API in real time, loading the signal strength threshold (such as the-70 dBm of Wi-Fi or the-60 dBm of bluetooth) from a preset configuration file, calculating the error rate in real time by counting the ratio of the error data packet to the total data packet of a communication protocol stack, and dividing the normalized delay coefficient by the preset maximum delay threshold (such as the 100ms of Wi-Fi or the 50ms of bluetooth) to obtain the dimensionless value by measuring the end-to-end delay (such as the millisecond time difference) of the data transmission. All parameters are stored in the form of floating point numbers, and the real-time dynamic updating and computing compatibility is ensured.
It is worth noting that the final performance weight is the product of three dimensionless factors, comprehensively reflects the comprehensive performance of the mode on signal quality, transmission reliability and delay efficiency, and the larger the numerical value is, the better the current performance of the mode is.
In summary, the invention discloses an automatic identification method and system of multi-mode remote controller equipment, aiming at improving the identification accuracy of user behaviors and environmental states in an intelligent home environment. The method firstly involves acquiring gesture data, environment parameters, voice data and key data of a user, wherein the data are acquired in real time through various sensors, so that the operation habit and environment information of the user are comprehensively captured, and rich original data are provided for subsequent data processing.
The method comprises the steps of extracting acceleration mean values, standard deviations and peak values from user gesture data through sensors such as accelerometers to construct gesture feature vectors, locating and matching environment parameters with pre-stored using scene maps to obtain current area codes, performing single-heat encoding on the codes to obtain the area feature vectors, extracting voice features from voice data to generate the voice feature vectors, and performing binary encoding on key data to form key code sequences. And then merging the gesture feature vector, the regional feature vector, the voice feature vector and the key code sequence into a high-dimensional user initial feature vector.
In order to reduce the computational complexity and remove redundant information, the invention adopts the standardized processing and then carries out Principal Component Analysis (PCA), and extracts the most representative dimension-reducing user characteristics from the initial characteristics of the user. And then, carrying out weighted fusion on the user characteristics after the dimension reduction by combining a preset modal contribution degree matrix to generate a fusion characteristic vector. The importance of each mode and the influence of each mode on the final decision are fully considered in the process, so that the fusion feature vector can reflect the real intention of the user.
And then, the fusion feature vector is input into a pre-trained mode prediction model, and a prediction standby mode is output. The model is based on the historical feature vector as input data, and parameters are optimized through a gradient descent algorithm, so that the fusion feature vector can be analyzed, and the communication mode which is most suitable for the current situation can be predicted. In addition, the invention also designs a mechanism for judging the confidence according to the performance difference between the predicted standby mode and the current used mode, so as to determine whether to generate the mode switching instruction. This process includes obtaining performance data for the currently used modality, calculating its performance weight, and calculating the modality confidence by a particular formula. And when the mode confidence coefficient exceeds a preset threshold value, triggering a mode switching instruction, otherwise, continuing to maintain the current mode.
Referring to fig. 2, a second embodiment of the present invention provides an automatic recognition system of a multi-mode remote controller device, comprising:
the data acquisition module is used for acquiring gesture data, environment parameters, voice data and key data of a user;
the initial feature module is used for carrying out numerical vectorization according to the user gesture data, the environment parameters, the voice data and the key data to obtain initial features of the user;
the data dimension reduction module is used for carrying out standardization according to the initial characteristics of the user and carrying out principal component analysis to obtain dimension reduction user characteristics;
the modal fusion module is used for carrying out weighted fusion according to the dimension reduction user characteristics and a preset modal contribution matrix to obtain fusion characteristic vectors;
The mode prediction module is used for inputting the fusion feature vector into a pre-trained mode prediction model and outputting a prediction standby mode;
and the mode switching module is used for judging the confidence degree according to the predicted standby mode and the pre-acquired current use mode and generating a mode switching instruction.
It should be noted that, the automatic identification system of the multi-mode remote controller device provided by the embodiment of the present invention is used for executing all the flow steps of the automatic identification method of the multi-mode remote controller device in the above embodiment, and the working principles and beneficial effects of the two correspond one to one, so that the description is omitted.
The embodiment of the invention also provides electronic equipment. The electronic device comprises a processor, a memory and a computer program, such as a data acquisition program, stored in the memory and executable on the processor. The processor, when executing the computer program, implements the steps in the above-described embodiments of the method for automatically identifying multi-mode remote control devices, for example, step S11 shown in fig. 1. Or the processor, when executing the computer program, performs the functions of the modules/units in the above-described device embodiments, such as a data acquisition module.
The computer program may be divided into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments are used for describing the execution of the computer program in the electronic device.
The electronic equipment can be a desktop computer, a notebook computer, a palm computer, an intelligent tablet and other computing equipment. The electronic device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above components are merely examples of electronic devices and are not limiting of electronic devices, and may include more or fewer components than those described above, or may combine certain components, or different components, e.g., the electronic devices may also include input-output devices, network access devices, buses, etc.
The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processor, digital signal processor (Digital SignalProcessor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the electronic device, connecting various parts of the overall electronic device using various interfaces and lines.
The memory may be used to store the computer program and/or modules, and the processor may implement various functions of the electronic device by running or executing the computer program and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area which may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area which may store data created according to the use of the cellular phone (such as audio data, a phonebook, etc.), etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMART MEDIA CARD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the integrated modules/units of the electronic device may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand alone products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.