Disclosure of Invention
Based on the foregoing, it is necessary to provide a man-machine interaction method, system, device and medium for solving the above technical problems.
The embodiment of the invention provides a man-machine interaction method, which comprises the following steps:
acquiring an action image when a user points to an interactive object;
Identifying the face and the hand of the user on the action image to respectively obtain a face image and a hand image of the user;
extracting characteristic values of the face image in a three-dimensional coordinate system established by taking control equipment as an origin, and calibrating the horizontal angle and the vertical angle of the face image according to the characteristic values of the face image so as to determine the sight direction of a user;
Extracting hand images, including starting, maintaining and ending pointing stage characteristics, establishing hidden Markov model HMM of each pointing stage characteristic, and carrying out optimal path selection on all hidden Markov model HMM so as to determine finger pointing of a user;
and taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.
Optionally, the step of identifying the face of the user on the action image specifically includes:
In the RGB space, stimulus values of captured image colors are:
,
Wherein, phi (lambda) represents the relative spectral power distribution of the color light perceived by human eyes, and r (lambda), g (lambda) and b (lambda) are CIE1964XYZ spectral stimulus values;
according to the constancy of human eyes in color perception, the influence of brightness in skin color representation is removed, and the component value of skin color is obtained, wherein the calculation formula of the component value is as follows:
,
Wherein R, G and B are stimulus values of image colors, r is a red component, g is a green component, and B is a blue component;
Normalizing the component values of skin colors, converting the RGB space into a r g space, and enabling the skin colors to be in two-dimensional normal distribution in the r g space;
Distinguishing the face of the user from the background image, and then carrying out gray processing to obtain a gray image;
and positioning the characteristic points of the face in the gray level image and extracting the face parameters to obtain the face image of the user.
Optionally, calibrating the horizontal angle and the vertical angle of the face image of the user according to the feature value of the face image specifically includes:
Dividing the facial image of the user from the vertical direction and the horizontal direction according to the number of nodes to obtain a standard coding set Y= [ Y 1,Y2,Y3,Y4,Y5];Y1 ] which represents the division result of the horizontal angle and the vertical angle of the facial image at a first node, Y 2 which represents the division result of the horizontal angle and the vertical angle of the facial image at a second node, Y 3 which represents the division result of the horizontal angle and the vertical angle of the facial image at a third node, Y 4 which represents the division result of the horizontal angle and the vertical angle of the facial image at a fourth node, and Y 5 which represents the division result of the horizontal angle and the vertical angle of the facial image at a fifth node;
Extracting a characteristic value from the facial image, and carrying out normalization processing on the characteristic value to obtain an input vector set X= (X 1,x2,… ,xm)T;
the input vector is input into a Gaussian RBF neural network GRBF to obtain an output vector set y= [ y 1,y2,y3,y4,y5 ] with the formula:
,
Wherein j, m=1, 2, 3, & gt, n are output vector dimensions, y j is an output vector of a j-th output node of the gaussian RBF neural network GRBF corresponding to the input vector, b j is a base function width of a network hidden layer node, x p is a P-th input vector, p=1, 2, 3, & gt, P are total number of input vectors, i=1, 2, 3, & gt, h are number of hidden layer units, w ij is a connection weight of a hidden layer to an output layer, c i is a center of the network hidden layer node, and σ is a variance of the base function;
the confidence coefficient beta is used for representing the matching degree between the output vector and the standard code, and the formula is as follows:
,
Wherein Y j is an element in a standard coding set, and n and j are positive integers;
If the confidence coefficient beta is larger than or equal to the set threshold alpha, the division result corresponding to the standard code is the horizontal angle and the vertical angle of the face image of the user.
Optionally, performing optimal path selection on all Hidden Markov Model (HMM), which specifically includes:
Constructing an observation sequence o=o 1O2O3……Ot,(Ot= (ωpan,t,ωtilt,t) based on each pointing stage feature of the hand image), wherein ω pan,t is a translational angular velocity of the pointing stage feature corresponding to the time point t, and ω tilt,t is a vertical angular velocity of the pointing stage feature corresponding to the time point t;
Constructing a model theta (A, B, pi), wherein pi is an initial state probability vector, A is a state transition probability matrix and B is an observation probability matrix;
Based on the observation sequence O and the model theta, a hidden Markov model HMM is constructed, and the expression P (O|theta) is as follows:
,
Wherein, O is the observation sequence, O 1、O2、…Or is the value in the observation sequence, Q is the best path, Q 1、q2、…、qr is the value in the best path, b q1、bq2、…、bqr is the value in the observation probability matrix, a q1、aq2、…、aqr is the value in the state transition probability matrix, r=1, 2, 3..and t;
p (o|θ) is represented by a parameter (a, B, pi) of the model θ, α (P (o|θ))/αθ=0, and the solution is taken as a re-estimation formula for each parameter;
and maximizing P (O|theta) according to a re-estimation formula of each parameter, obtaining an optimal path, and determining the finger pointing angle of the user through the optimal path.
The embodiment of the invention also provides a man-machine interaction system, which comprises:
The image acquisition module is used for acquiring action images when a user points to the interactive object;
the image recognition module is used for recognizing the face and the hand of the user on the action image to respectively obtain a face image and a hand image of the user;
The face processing module is used for extracting the characteristic value of the face image in a three-dimensional coordinate system established by taking the control equipment as an origin, calibrating the horizontal angle and the vertical angle of the face image according to the characteristic value of the face image, and determining the sight direction of a user;
The hand processing module is used for extracting the hand image including the starting, maintaining and ending pointing stage characteristics, establishing a Hidden Markov Model (HMM) of each pointing stage characteristic, and carrying out optimal path selection on all the hidden Markov model HMMs so as to determine the finger pointing of a user;
And the response module is used for taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.
The embodiment of the invention also provides computer equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the man-machine interaction method when executing the computer program.
The embodiment of the invention also provides a storage medium, on which a computer program is stored, which when being executed by a processor, realizes the steps of the man-machine interaction method.
Compared with the prior art, the man-machine interaction method, system, equipment and medium provided by the embodiment of the invention have the following beneficial effects:
In the prior art, when human-computer interaction is performed, a user focuses on a point on a computer screen through eye images to serve as a judging means of interaction operation, the user is required to be close to a camera, and head deflection cannot be excessive. Misjudgment is easy to occur when the head of the customer deflects too much, so that wrong interaction operation occurs, and the expected interaction effect cannot be achieved.
The face image and the hand image of the user are respectively obtained by identifying the face and the hand of the user on the action image, the sight angle of the user is determined based on the face image, the pointing angle of the finger of the user is determined based on the hand image, the intersection point of the extension lines of the sight angle and the pointing angle of the finger of the user is used as the sight focus, the user input focus is determined by combining the sight direction and the pointing direction of the finger, and the user input focus is determined by combining the sight angle and the pointing angle of the finger, so that a single judging means is avoided, the problem that misjudgment is easy to occur by taking the eye image as the judging means in the prior art is solved, and the accuracy and the identifying speed of man-machine interaction are further improved.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
1. Description of the principles
As shown in fig. 1, the center point of the face after face recognition is taken as the origin of coordinates, and the projection of the line-of-sight direction and the finger pointing on the x-axis and the z-axis is 360 ° rotated and the y-axis in the vertical direction form a cylindrical coordinate system.
Δθ=(θHead-θHand),
Δφ=(φHead-φHand) 。
Where θ Head and θ Hand represent the horizontal angles of the line of sight and the finger pointing, φ Head and φ Hand represent the vertical angles of the line of sight and the finger pointing, and Δθ and Δφ represent the difference between the horizontal and vertical angles.
As shown in fig. 2, it was found that both were substantially fixed values during the pointing-maintaining phase, that is, the position and angular relationship of the head and hand were fixed while the human body was pointing at an object. Therefore, the relationship between the two can be applied to a human-computer interaction system to quickly position the pointing focus of the user.
In one embodiment, a human-computer interaction method is provided, which includes:
1. the working principle is as follows, as shown in fig. 3 and 4:
(1) The user is within a suitable distance from the interactive object (screen, robot, other device, etc.), the control device obtains an action image of the user pointing at the interactive object, the action image comprising an image of the complete process of the user pointing at the interactive object. On the basis of preprocessing the images, the face and the hands of the user on the action image are identified, and the face image and the hand image of the user are obtained respectively. And extracting the characteristic value of the face image in a three-dimensional coordinate system established by taking the control equipment as an origin, and calibrating the horizontal angle and the vertical angle of the face image according to the characteristic value of the face image so as to determine the sight direction of the user. Extracting hand images comprises starting, maintaining and ending pointing stage characteristics, establishing hidden Markov model HMM of each pointing stage characteristic, and carrying out optimal path selection on all hidden Markov model HMM so as to determine the finger pointing of a user. And taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.
(2) A gaussian RBF neural network (Generalized Radial Basis Function, GRBF) that is more suitable for addressing highly nonlinear problems is used to calibrate the orientation of a person's face. And extracting relevant characteristic values from the processed face gray level images, inputting the characteristic values into a neural network which is learned and trained in advance, and calibrating the horizontal angle and the vertical angle of the face of the person through comparison and result evaluation so as to determine the sight line direction.
(3) According to the characteristics of finger pointing, a hidden Markov model (Hidden Markov Model, HMM) is adopted to determine the angle of the finger pointing. At interaction, this stage is separated out using a hidden Markov model HMM, and the finger pointing is determined.
(4) And when the sight line direction and the finger direction are determined, taking the intersection point of the sight line direction and the finger direction of the user as a user input focus, and controlling the interactive object to execute a response action according to the user input focus so as to realize man-machine interaction.
2. Sight direction identification
2.1 Facial image acquisition
According to colorimetry theory, stimulus values of captured image colors in RGB space can be calculated by the following formula:
。
Wherein phi (lambda) represents the relative spectral power distribution of color light perceived by human eyes, r (lambda), g (lambda), b (lambda) are CIE1964XYZ spectral stimulus values, and the integral range is the wave band of visible light, generally 380nm-780nm.
The brightness of the acquired image has a great influence on the accuracy of recognition. Therefore, according to the constancy of human eyes in color perception, the influence of brightness in skin color representation is removed, and the component value of skin color is obtained, wherein the calculation formula of the component value is as follows:
。
wherein R, G and B are stimulus values of image colors, r is a red component, g is a green component, and B is a blue component.
The component values of skin colors are normalized, so that RGB space can be converted into r g space, and the skin colors are in two-dimensional normal distribution in r g space.
And distinguishing the face of the user from the background image, and then carrying out gray processing to obtain a gray image. And positioning the characteristic points of the face in the gray level image and extracting the face parameters to obtain the face image of the user.
2.1.1 Model selection
Since facial orientation recognition is a highly non-linear problem, a gaussian RBF neural network GRBF is employed. As shown in fig. 5, the gaussian RBF neural network GRBF with m-h-n structure, m is the input vector dimension, h is the number of hidden layer units, and n is the output vector dimension. X= (X 1, x2,… ,xm)T is the input vector set of the network, and is composed of signal source nodes.
The hidden layer adopts a nonlinear optimization strategy to adjust the parameters of the activation function. The number of hidden layer elements h depends on the problem described. The radial basis function selects the gaussian function, namely:
。
Wherein c i is the center of the network hidden layer node, σ is the variance of the basis function, x p is the P-th input vector, p=1, 2, 3..p, P is the total number of input vectors.
Since it is difficult to find a specific angle of the face direction, the vertical angle and the horizontal angle can be divided. The face image of the user is divided according to the number of nodes from the vertical direction and the horizontal direction, 25 possible standard codes consisting of 5-bit codes can be represented, and a standard code set Y= [ Y1, Y2, Y3, Y4, Y5] representing the horizontal angle and the vertical angle of the face image is obtained. Y1 is the division result of the horizontal angle and the vertical angle of the face image at the first node, Y2 is the division result of the horizontal angle and the vertical angle of the face image at the second node, Y3 is the division result of the horizontal angle and the vertical angle of the face image at the third node, Y4 is the division result of the horizontal angle and the vertical angle of the face image at the fourth node, Y5 is the division result of the horizontal angle and the vertical angle of the face image at the fifth node, and the standard coding scheme is shown in Table 1:
table 1 face orientation standard coding
And extracting the characteristic value from the facial image, and carrying out normalization processing on the characteristic value to obtain an input vector set X= (X 1,x2,… ,xm)T).
The input vector is input into a Gaussian RBF neural network GRBF to obtain an output vector set y= [ y1, y2, y3, y4, y5], and the formula is as follows:
。
Wherein j, m=1, 2, 3, n is the output vector dimension, y j is the output vector of the j-th output node of the gaussian RBF neural network GRBF corresponding to the input vector, b j is the base function width of the network hidden layer node, x p is the P-th input vector, p=1, 2, 3, P is the total number of input vectors, i=1, 2, 3, h is the number of hidden layer units, w ij is the connection weight of the hidden layer to the output layer, c i is the center of the network hidden layer node, and σ is the variance of the base function.
Let d be the expected output value of the input vector, the variance σ of the basis function can be expressed as:
。
Wherein c i is the center of the node of the hidden layer of the network, y j is the output vector of the j-th output node of the Gaussian RBF neural network GRBF corresponding to the input vector, and sigma is the variance of the basis function.
2.1.2 Results
Based on the standard code and the output vector, obtaining a confidence coefficient beta, wherein the confidence coefficient beta is used for representing the matching degree between the output vector and the standard code, and the formula is as follows:
。
Wherein Y j is an element in the standard code set, and n and j are positive integers.
If the confidence coefficient beta is larger than or equal to the set threshold alpha, the division result corresponding to the standard code is the horizontal angle and the vertical angle of the user face image, otherwise, the division result is the face orientation which cannot be identified by the user face image corresponding to the standard code.
2.1.3 Gaussian RBF neural network GRBF
The parameters to be solved of the Gaussian RBF neural network GRBF are 3, namely the center c of the basis function, the variance sigma i and the weight from the hidden layer to the output layer.
A. based on K-means clustering method, basis function center c is obtained
① And initializing a network. H training samples were randomly chosen as cluster centers c i (i=1, 2,..
② The input training sample sets are grouped by nearest neighbor rule with x p assigned to each cluster set delta p (p=1, 2..p) by euclidean distance between x p and center c i.
③ Readjusting the cluster center. Calculating the average value of training samples in each cluster set delta p, namely a new cluster center c i, if the new cluster center is not changed any more, obtaining c i which is the final basis function center c of the GRBF neural network, otherwise, returning to ②, and entering the center solution of the next round.
B. solving for variance sigma i
。
Where c max is the maximum distance between selected centers, i=1, 2.
C. calculating weight w between hidden layer and output layer
The connection weight of the neurons from the hidden layer to the output layer is directly calculated by a least square method, and the calculation formula is as follows:
。
where p=1, 2,..p, P is the total number of input vectors, i=1, 2,..h.
2.2 Finger pointing identification
The pointing procedure of the hand is divided into three pointing stage features, start, hold and end, respectively. Respective hidden Markov models HMMs are built for the three pointing stage features, whose state set topologies are shown in FIG. 6. And carrying out optimal path selection on all hidden Markov model HMMs to determine the finger pointing angle of the user.
An observation sequence o=o 1O2O3……Ot,(Ot= (ωpan,t,ωtilt,t) is constructed based on each pointing phase feature of the hand image). Wherein ω pan,t is the translational angular velocity of the pointing stage feature corresponding to the time point t, and ω tilt,t is the vertical angular velocity of the pointing stage feature corresponding to the time point t.
And constructing a model theta (A, B, pi), wherein pi is an initial state probability vector, A is a state transition probability matrix and B is an observation probability matrix.
Based on the observation sequence O and the model theta, a hidden Markov model HMM is constructed, and the expression P (O|theta) is as follows:
。
Where O is the observation sequence, O 1、O2、…、Or is the value in the observation sequence, Q is the best path, Q 1、q2、…、qr is the value in the best path, b q1、bq2、…、bqr is the value in the observation probability matrix, a q1、aq2、…、aqr is the value in the state transition probability matrix, r=1, 2, 3.
Then, the best path q=q 1q2q3…qt is selected to use the segmented K-means algorithm SEGMENTAL K-means based on the viterbi decoding algorithm Viterbi (Viterbi decoding), the basic idea is to represent P (o|θ) with the parameters (a, B, pi) of the model θ, let α (P (o|θ))/αθ=0, and take the solution as a re-estimation formula for each parameter. And maximizing P (O|theta) according to a re-estimation formula of each parameter, obtaining an optimal path, and determining the finger pointing angle of the user through the optimal path.
3. Experimental results and analysis
The main frequency of the computer used in the experiment is 3.0 GHz, the internal memory is 2 GB, the image acquisition equipment is a common camera, and the size of the acquired image is 352 multiplied by 288. After the user image is processed, the user image is converted into a mouse control signal to realize the control of a computer, and the speed of a visual processing part in an experiment can reach 15 frames/s. In screen positioning, a computer displays an image of a 10×10 table full screen, and an experimenter points to each cell in the table in turn to test the positioning accuracy of the screen.
Table 2 lists test data for screen positioning accuracy of the system under different lighting conditions when the experimenter is away from the camera 3 m.
When the light condition is good, the invention can better position the screen and recognize the gesture of the user. When the light condition is poor, the gray level of the image can be influenced, so that the accuracy of human eye detection and hand detection is reduced, and the accuracy of the system is influenced.
Table 2 screen positioning test
FIG. 7 shows test data of screen positioning and gesture recognition accuracy of the system for different distances from the experimenter to the camera under good light conditions. Due to the limitation of the resolution of the acquired image, when the distance from the user to the camera increases, the accuracy of screen positioning and gesture recognition of the system is somewhat reduced. However, as can be seen from fig. 7, the present invention has great advantages over conventional gaze tracking and gesture recognition techniques in terms of remote human-machine interaction.
Based on the same inventive concept, the invention also provides a human-computer interaction system, which comprises:
The image acquisition module is used for acquiring action images when a user points to the interactive object;
the image recognition module is used for recognizing the face and the hand of the user on the action image to respectively obtain a face image and a hand image of the user;
The face processing module is used for extracting the characteristic value of the face image in a three-dimensional coordinate system established by taking the control equipment as an origin, calibrating the horizontal angle and the vertical angle of the face image according to the characteristic value of the face image, and determining the sight direction of a user;
The hand processing module is used for extracting the hand image including the starting, maintaining and ending pointing stage characteristics, establishing a Hidden Markov Model (HMM) of each pointing stage characteristic, and carrying out optimal path selection on all the hidden Markov model HMMs so as to determine the finger pointing of a user;
And the response module is used for taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.
Furthermore, the invention also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the man-machine interaction method when executing the computer program. The specific implementation method may refer to a method embodiment, and will not be described herein.
Further, the present invention also provides a storage medium having stored thereon a computer program, such as a memory containing instructions executable by a processor of a computer device to perform the above method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. The computer program, when executed by a processor, enables the implementation of the steps in the embodiments of the human-computer interaction method. The specific implementation method may refer to a method embodiment, and will not be described herein.
The foregoing examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.