[go: up one dir, main page]

CN119045668B - Man-machine interaction method, system, equipment and medium - Google Patents

Man-machine interaction method, system, equipment and medium

Info

Publication number
CN119045668B
CN119045668B CN202411235489.0A CN202411235489A CN119045668B CN 119045668 B CN119045668 B CN 119045668B CN 202411235489 A CN202411235489 A CN 202411235489A CN 119045668 B CN119045668 B CN 119045668B
Authority
CN
China
Prior art keywords
image
user
facial image
node
human
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411235489.0A
Other languages
Chinese (zh)
Other versions
CN119045668A (en
Inventor
孟军英
王丽娜
杨争艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shijiazhuang University
Original Assignee
Shijiazhuang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shijiazhuang University filed Critical Shijiazhuang University
Priority to CN202411235489.0A priority Critical patent/CN119045668B/en
Publication of CN119045668A publication Critical patent/CN119045668A/en
Application granted granted Critical
Publication of CN119045668B publication Critical patent/CN119045668B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/84Arrangements for image or video recognition or understanding using pattern recognition or machine learning using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks
    • G06V10/85Markov-related models; Markov random fields
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种人机交互方法、系统、设备和介质,其涉及计算机视觉识别技术领域。包括:获取用户指向交互物体时的动作图像;对动作图像上用户的面部和手部进行识别,得到面部图像和手部图像;在以控制设备为原点的坐标系内,提取面部图像的特征值,根据特征值标定出面部图像的水平角度和垂直角度,以确定用户的视线方向;建立手部图像指向阶段特征的隐马尔可夫模型HMM,以确定用户的手指指向;将用户的视线方向和手指指向的交点作为用户输入焦点,根据用户输入焦点控制交互物体执行响应动作,以实现人机交互。本发明提高人机交互的准确度以及识别速度。

The present invention discloses a human-computer interaction method, system, device, and medium, which relate to the field of computer vision recognition technology. The method comprises: obtaining an action image of a user pointing at an interactive object; recognizing the user's face and hand in the action image to obtain a facial image and a hand image; extracting feature values of the facial image within a coordinate system with a control device as the origin, calibrating the horizontal and vertical angles of the facial image based on the feature values to determine the user's line of sight; establishing a hidden Markov model (HMM) of the hand image's pointing phase features to determine the user's finger pointing; using the intersection of the user's line of sight and the finger pointing as the user input focus, and controlling the interactive object to perform a response action based on the user input focus to achieve human-computer interaction. The present invention improves the accuracy and recognition speed of human-computer interaction.

Description

Man-machine interaction method, system, equipment and medium
Technical Field
The invention relates to the technical field of computer vision recognition, in particular to a man-machine interaction method, a man-machine interaction system, man-machine interaction equipment and man-machine interaction media.
Background
Human-computer interaction (Human-Computer Interaction) is a technology of interaction between a researcher and a computer, and aims to utilize all possible information channels to perform Human-computer communication and improve naturalness and efficiency of interaction.
In the prior art, baluja and Pomerleau propose a method for estimating the gaze point of a user on a computer screen from images of the human eye, which are input into a neural network to infer the gaze location of the eye on the computer screen. The method comprises the specific process of capturing eye images of a user through a high-definition camera to ensure high definition of the images so as to accurately capture details such as pupils, irises and the like. The preprocessed eye image is input to the neural network, and in the inference stage of the neural network, the input eye image is converted into a series of feature maps, which are further decoded into on-screen gaze point coordinates.
The prior art has the defect that during man-machine interaction, the user is required to approach a camera and the head deflection cannot be excessive by taking eye images on a computer screen as a discrimination means of interaction operation. Misjudgment is easy to occur when the head of the customer deflects too much, so that wrong interaction operation occurs, and the expected interaction effect cannot be achieved.
Disclosure of Invention
Based on the foregoing, it is necessary to provide a man-machine interaction method, system, device and medium for solving the above technical problems.
The embodiment of the invention provides a man-machine interaction method, which comprises the following steps:
acquiring an action image when a user points to an interactive object;
Identifying the face and the hand of the user on the action image to respectively obtain a face image and a hand image of the user;
extracting characteristic values of the face image in a three-dimensional coordinate system established by taking control equipment as an origin, and calibrating the horizontal angle and the vertical angle of the face image according to the characteristic values of the face image so as to determine the sight direction of a user;
Extracting hand images, including starting, maintaining and ending pointing stage characteristics, establishing hidden Markov model HMM of each pointing stage characteristic, and carrying out optimal path selection on all hidden Markov model HMM so as to determine finger pointing of a user;
and taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.
Optionally, the step of identifying the face of the user on the action image specifically includes:
In the RGB space, stimulus values of captured image colors are:
,
Wherein, phi (lambda) represents the relative spectral power distribution of the color light perceived by human eyes, and r (lambda), g (lambda) and b (lambda) are CIE1964XYZ spectral stimulus values;
according to the constancy of human eyes in color perception, the influence of brightness in skin color representation is removed, and the component value of skin color is obtained, wherein the calculation formula of the component value is as follows:
,
Wherein R, G and B are stimulus values of image colors, r is a red component, g is a green component, and B is a blue component;
Normalizing the component values of skin colors, converting the RGB space into a r g space, and enabling the skin colors to be in two-dimensional normal distribution in the r g space;
Distinguishing the face of the user from the background image, and then carrying out gray processing to obtain a gray image;
and positioning the characteristic points of the face in the gray level image and extracting the face parameters to obtain the face image of the user.
Optionally, calibrating the horizontal angle and the vertical angle of the face image of the user according to the feature value of the face image specifically includes:
Dividing the facial image of the user from the vertical direction and the horizontal direction according to the number of nodes to obtain a standard coding set Y= [ Y 1,Y2,Y3,Y4,Y5];Y1 ] which represents the division result of the horizontal angle and the vertical angle of the facial image at a first node, Y 2 which represents the division result of the horizontal angle and the vertical angle of the facial image at a second node, Y 3 which represents the division result of the horizontal angle and the vertical angle of the facial image at a third node, Y 4 which represents the division result of the horizontal angle and the vertical angle of the facial image at a fourth node, and Y 5 which represents the division result of the horizontal angle and the vertical angle of the facial image at a fifth node;
Extracting a characteristic value from the facial image, and carrying out normalization processing on the characteristic value to obtain an input vector set X= (X 1,x2,… ,xm)T;
the input vector is input into a Gaussian RBF neural network GRBF to obtain an output vector set y= [ y 1,y2,y3,y4,y5 ] with the formula:
,
Wherein j, m=1, 2, 3, & gt, n are output vector dimensions, y j is an output vector of a j-th output node of the gaussian RBF neural network GRBF corresponding to the input vector, b j is a base function width of a network hidden layer node, x p is a P-th input vector, p=1, 2, 3, & gt, P are total number of input vectors, i=1, 2, 3, & gt, h are number of hidden layer units, w ij is a connection weight of a hidden layer to an output layer, c i is a center of the network hidden layer node, and σ is a variance of the base function;
the confidence coefficient beta is used for representing the matching degree between the output vector and the standard code, and the formula is as follows:
,
Wherein Y j is an element in a standard coding set, and n and j are positive integers;
If the confidence coefficient beta is larger than or equal to the set threshold alpha, the division result corresponding to the standard code is the horizontal angle and the vertical angle of the face image of the user.
Optionally, performing optimal path selection on all Hidden Markov Model (HMM), which specifically includes:
Constructing an observation sequence o=o 1O2O3……Ot,(Ot= (ωpan,ttilt,t) based on each pointing stage feature of the hand image), wherein ω pan,t is a translational angular velocity of the pointing stage feature corresponding to the time point t, and ω tilt,t is a vertical angular velocity of the pointing stage feature corresponding to the time point t;
Constructing a model theta (A, B, pi), wherein pi is an initial state probability vector, A is a state transition probability matrix and B is an observation probability matrix;
Based on the observation sequence O and the model theta, a hidden Markov model HMM is constructed, and the expression P (O|theta) is as follows:
,
Wherein, O is the observation sequence, O 1、O2、…Or is the value in the observation sequence, Q is the best path, Q 1、q2、…、qr is the value in the best path, b q1、bq2、…、bqr is the value in the observation probability matrix, a q1、aq2、…、aqr is the value in the state transition probability matrix, r=1, 2, 3..and t;
p (o|θ) is represented by a parameter (a, B, pi) of the model θ, α (P (o|θ))/αθ=0, and the solution is taken as a re-estimation formula for each parameter;
and maximizing P (O|theta) according to a re-estimation formula of each parameter, obtaining an optimal path, and determining the finger pointing angle of the user through the optimal path.
The embodiment of the invention also provides a man-machine interaction system, which comprises:
The image acquisition module is used for acquiring action images when a user points to the interactive object;
the image recognition module is used for recognizing the face and the hand of the user on the action image to respectively obtain a face image and a hand image of the user;
The face processing module is used for extracting the characteristic value of the face image in a three-dimensional coordinate system established by taking the control equipment as an origin, calibrating the horizontal angle and the vertical angle of the face image according to the characteristic value of the face image, and determining the sight direction of a user;
The hand processing module is used for extracting the hand image including the starting, maintaining and ending pointing stage characteristics, establishing a Hidden Markov Model (HMM) of each pointing stage characteristic, and carrying out optimal path selection on all the hidden Markov model HMMs so as to determine the finger pointing of a user;
And the response module is used for taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.
The embodiment of the invention also provides computer equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the man-machine interaction method when executing the computer program.
The embodiment of the invention also provides a storage medium, on which a computer program is stored, which when being executed by a processor, realizes the steps of the man-machine interaction method.
Compared with the prior art, the man-machine interaction method, system, equipment and medium provided by the embodiment of the invention have the following beneficial effects:
In the prior art, when human-computer interaction is performed, a user focuses on a point on a computer screen through eye images to serve as a judging means of interaction operation, the user is required to be close to a camera, and head deflection cannot be excessive. Misjudgment is easy to occur when the head of the customer deflects too much, so that wrong interaction operation occurs, and the expected interaction effect cannot be achieved.
The face image and the hand image of the user are respectively obtained by identifying the face and the hand of the user on the action image, the sight angle of the user is determined based on the face image, the pointing angle of the finger of the user is determined based on the hand image, the intersection point of the extension lines of the sight angle and the pointing angle of the finger of the user is used as the sight focus, the user input focus is determined by combining the sight direction and the pointing direction of the finger, and the user input focus is determined by combining the sight angle and the pointing angle of the finger, so that a single judging means is avoided, the problem that misjudgment is easy to occur by taking the eye image as the judging means in the prior art is solved, and the accuracy and the identifying speed of man-machine interaction are further improved.
Drawings
FIG. 1 is a head-hand relationship cylindrical coordinate system of a human-machine interaction method provided in one embodiment;
FIG. 2 is a diagram of a head-hand relationship of pointing actions for a human-machine interaction method provided in one embodiment;
FIG. 3 is a schematic diagram of a method of human-machine interaction according to one embodiment;
FIG. 4 is a flow chart of a method of human-computer interaction provided in one embodiment;
FIG. 5 is a neural network model diagram of a human-machine interaction method provided in one embodiment;
FIG. 6 is a three-state hidden Markov state set topology of a human-machine interaction method provided in one embodiment;
Fig. 7 is a screen positioning effect diagram of a man-machine interaction method according to an embodiment.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
1. Description of the principles
As shown in fig. 1, the center point of the face after face recognition is taken as the origin of coordinates, and the projection of the line-of-sight direction and the finger pointing on the x-axis and the z-axis is 360 ° rotated and the y-axis in the vertical direction form a cylindrical coordinate system.
Δθ=(θHeadHand),
Δφ=(φHeadHand) 。
Where θ Head and θ Hand represent the horizontal angles of the line of sight and the finger pointing, φ Head and φ Hand represent the vertical angles of the line of sight and the finger pointing, and Δθ and Δφ represent the difference between the horizontal and vertical angles.
As shown in fig. 2, it was found that both were substantially fixed values during the pointing-maintaining phase, that is, the position and angular relationship of the head and hand were fixed while the human body was pointing at an object. Therefore, the relationship between the two can be applied to a human-computer interaction system to quickly position the pointing focus of the user.
In one embodiment, a human-computer interaction method is provided, which includes:
1. the working principle is as follows, as shown in fig. 3 and 4:
(1) The user is within a suitable distance from the interactive object (screen, robot, other device, etc.), the control device obtains an action image of the user pointing at the interactive object, the action image comprising an image of the complete process of the user pointing at the interactive object. On the basis of preprocessing the images, the face and the hands of the user on the action image are identified, and the face image and the hand image of the user are obtained respectively. And extracting the characteristic value of the face image in a three-dimensional coordinate system established by taking the control equipment as an origin, and calibrating the horizontal angle and the vertical angle of the face image according to the characteristic value of the face image so as to determine the sight direction of the user. Extracting hand images comprises starting, maintaining and ending pointing stage characteristics, establishing hidden Markov model HMM of each pointing stage characteristic, and carrying out optimal path selection on all hidden Markov model HMM so as to determine the finger pointing of a user. And taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.
(2) A gaussian RBF neural network (Generalized Radial Basis Function, GRBF) that is more suitable for addressing highly nonlinear problems is used to calibrate the orientation of a person's face. And extracting relevant characteristic values from the processed face gray level images, inputting the characteristic values into a neural network which is learned and trained in advance, and calibrating the horizontal angle and the vertical angle of the face of the person through comparison and result evaluation so as to determine the sight line direction.
(3) According to the characteristics of finger pointing, a hidden Markov model (Hidden Markov Model, HMM) is adopted to determine the angle of the finger pointing. At interaction, this stage is separated out using a hidden Markov model HMM, and the finger pointing is determined.
(4) And when the sight line direction and the finger direction are determined, taking the intersection point of the sight line direction and the finger direction of the user as a user input focus, and controlling the interactive object to execute a response action according to the user input focus so as to realize man-machine interaction.
2. Sight direction identification
2.1 Facial image acquisition
According to colorimetry theory, stimulus values of captured image colors in RGB space can be calculated by the following formula:
Wherein phi (lambda) represents the relative spectral power distribution of color light perceived by human eyes, r (lambda), g (lambda), b (lambda) are CIE1964XYZ spectral stimulus values, and the integral range is the wave band of visible light, generally 380nm-780nm.
The brightness of the acquired image has a great influence on the accuracy of recognition. Therefore, according to the constancy of human eyes in color perception, the influence of brightness in skin color representation is removed, and the component value of skin color is obtained, wherein the calculation formula of the component value is as follows:
wherein R, G and B are stimulus values of image colors, r is a red component, g is a green component, and B is a blue component.
The component values of skin colors are normalized, so that RGB space can be converted into r g space, and the skin colors are in two-dimensional normal distribution in r g space.
And distinguishing the face of the user from the background image, and then carrying out gray processing to obtain a gray image. And positioning the characteristic points of the face in the gray level image and extracting the face parameters to obtain the face image of the user.
2.1.1 Model selection
Since facial orientation recognition is a highly non-linear problem, a gaussian RBF neural network GRBF is employed. As shown in fig. 5, the gaussian RBF neural network GRBF with m-h-n structure, m is the input vector dimension, h is the number of hidden layer units, and n is the output vector dimension. X= (X 1, x2,… ,xm)T is the input vector set of the network, and is composed of signal source nodes.
The hidden layer adopts a nonlinear optimization strategy to adjust the parameters of the activation function. The number of hidden layer elements h depends on the problem described. The radial basis function selects the gaussian function, namely:
Wherein c i is the center of the network hidden layer node, σ is the variance of the basis function, x p is the P-th input vector, p=1, 2, 3..p, P is the total number of input vectors.
Since it is difficult to find a specific angle of the face direction, the vertical angle and the horizontal angle can be divided. The face image of the user is divided according to the number of nodes from the vertical direction and the horizontal direction, 25 possible standard codes consisting of 5-bit codes can be represented, and a standard code set Y= [ Y1, Y2, Y3, Y4, Y5] representing the horizontal angle and the vertical angle of the face image is obtained. Y1 is the division result of the horizontal angle and the vertical angle of the face image at the first node, Y2 is the division result of the horizontal angle and the vertical angle of the face image at the second node, Y3 is the division result of the horizontal angle and the vertical angle of the face image at the third node, Y4 is the division result of the horizontal angle and the vertical angle of the face image at the fourth node, Y5 is the division result of the horizontal angle and the vertical angle of the face image at the fifth node, and the standard coding scheme is shown in Table 1:
table 1 face orientation standard coding
And extracting the characteristic value from the facial image, and carrying out normalization processing on the characteristic value to obtain an input vector set X= (X 1,x2,… ,xm)T).
The input vector is input into a Gaussian RBF neural network GRBF to obtain an output vector set y= [ y1, y2, y3, y4, y5], and the formula is as follows:
Wherein j, m=1, 2, 3, n is the output vector dimension, y j is the output vector of the j-th output node of the gaussian RBF neural network GRBF corresponding to the input vector, b j is the base function width of the network hidden layer node, x p is the P-th input vector, p=1, 2, 3, P is the total number of input vectors, i=1, 2, 3, h is the number of hidden layer units, w ij is the connection weight of the hidden layer to the output layer, c i is the center of the network hidden layer node, and σ is the variance of the base function.
Let d be the expected output value of the input vector, the variance σ of the basis function can be expressed as:
Wherein c i is the center of the node of the hidden layer of the network, y j is the output vector of the j-th output node of the Gaussian RBF neural network GRBF corresponding to the input vector, and sigma is the variance of the basis function.
2.1.2 Results
Based on the standard code and the output vector, obtaining a confidence coefficient beta, wherein the confidence coefficient beta is used for representing the matching degree between the output vector and the standard code, and the formula is as follows:
Wherein Y j is an element in the standard code set, and n and j are positive integers.
If the confidence coefficient beta is larger than or equal to the set threshold alpha, the division result corresponding to the standard code is the horizontal angle and the vertical angle of the user face image, otherwise, the division result is the face orientation which cannot be identified by the user face image corresponding to the standard code.
2.1.3 Gaussian RBF neural network GRBF
The parameters to be solved of the Gaussian RBF neural network GRBF are 3, namely the center c of the basis function, the variance sigma i and the weight from the hidden layer to the output layer.
A. based on K-means clustering method, basis function center c is obtained
① And initializing a network. H training samples were randomly chosen as cluster centers c i (i=1, 2,..
② The input training sample sets are grouped by nearest neighbor rule with x p assigned to each cluster set delta p (p=1, 2..p) by euclidean distance between x p and center c i.
③ Readjusting the cluster center. Calculating the average value of training samples in each cluster set delta p, namely a new cluster center c i, if the new cluster center is not changed any more, obtaining c i which is the final basis function center c of the GRBF neural network, otherwise, returning to ②, and entering the center solution of the next round.
B. solving for variance sigma i
Where c max is the maximum distance between selected centers, i=1, 2.
C. calculating weight w between hidden layer and output layer
The connection weight of the neurons from the hidden layer to the output layer is directly calculated by a least square method, and the calculation formula is as follows:
where p=1, 2,..p, P is the total number of input vectors, i=1, 2,..h.
2.2 Finger pointing identification
The pointing procedure of the hand is divided into three pointing stage features, start, hold and end, respectively. Respective hidden Markov models HMMs are built for the three pointing stage features, whose state set topologies are shown in FIG. 6. And carrying out optimal path selection on all hidden Markov model HMMs to determine the finger pointing angle of the user.
An observation sequence o=o 1O2O3……Ot,(Ot= (ωpan,ttilt,t) is constructed based on each pointing phase feature of the hand image). Wherein ω pan,t is the translational angular velocity of the pointing stage feature corresponding to the time point t, and ω tilt,t is the vertical angular velocity of the pointing stage feature corresponding to the time point t.
And constructing a model theta (A, B, pi), wherein pi is an initial state probability vector, A is a state transition probability matrix and B is an observation probability matrix.
Based on the observation sequence O and the model theta, a hidden Markov model HMM is constructed, and the expression P (O|theta) is as follows:
Where O is the observation sequence, O 1、O2、…、Or is the value in the observation sequence, Q is the best path, Q 1、q2、…、qr is the value in the best path, b q1、bq2、…、bqr is the value in the observation probability matrix, a q1、aq2、…、aqr is the value in the state transition probability matrix, r=1, 2, 3.
Then, the best path q=q 1q2q3…qt is selected to use the segmented K-means algorithm SEGMENTAL K-means based on the viterbi decoding algorithm Viterbi (Viterbi decoding), the basic idea is to represent P (o|θ) with the parameters (a, B, pi) of the model θ, let α (P (o|θ))/αθ=0, and take the solution as a re-estimation formula for each parameter. And maximizing P (O|theta) according to a re-estimation formula of each parameter, obtaining an optimal path, and determining the finger pointing angle of the user through the optimal path.
3. Experimental results and analysis
The main frequency of the computer used in the experiment is 3.0 GHz, the internal memory is 2 GB, the image acquisition equipment is a common camera, and the size of the acquired image is 352 multiplied by 288. After the user image is processed, the user image is converted into a mouse control signal to realize the control of a computer, and the speed of a visual processing part in an experiment can reach 15 frames/s. In screen positioning, a computer displays an image of a 10×10 table full screen, and an experimenter points to each cell in the table in turn to test the positioning accuracy of the screen.
Table 2 lists test data for screen positioning accuracy of the system under different lighting conditions when the experimenter is away from the camera 3 m.
When the light condition is good, the invention can better position the screen and recognize the gesture of the user. When the light condition is poor, the gray level of the image can be influenced, so that the accuracy of human eye detection and hand detection is reduced, and the accuracy of the system is influenced.
Table 2 screen positioning test
FIG. 7 shows test data of screen positioning and gesture recognition accuracy of the system for different distances from the experimenter to the camera under good light conditions. Due to the limitation of the resolution of the acquired image, when the distance from the user to the camera increases, the accuracy of screen positioning and gesture recognition of the system is somewhat reduced. However, as can be seen from fig. 7, the present invention has great advantages over conventional gaze tracking and gesture recognition techniques in terms of remote human-machine interaction.
Based on the same inventive concept, the invention also provides a human-computer interaction system, which comprises:
The image acquisition module is used for acquiring action images when a user points to the interactive object;
the image recognition module is used for recognizing the face and the hand of the user on the action image to respectively obtain a face image and a hand image of the user;
The face processing module is used for extracting the characteristic value of the face image in a three-dimensional coordinate system established by taking the control equipment as an origin, calibrating the horizontal angle and the vertical angle of the face image according to the characteristic value of the face image, and determining the sight direction of a user;
The hand processing module is used for extracting the hand image including the starting, maintaining and ending pointing stage characteristics, establishing a Hidden Markov Model (HMM) of each pointing stage characteristic, and carrying out optimal path selection on all the hidden Markov model HMMs so as to determine the finger pointing of a user;
And the response module is used for taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.
Furthermore, the invention also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the man-machine interaction method when executing the computer program. The specific implementation method may refer to a method embodiment, and will not be described herein.
Further, the present invention also provides a storage medium having stored thereon a computer program, such as a memory containing instructions executable by a processor of a computer device to perform the above method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. The computer program, when executed by a processor, enables the implementation of the steps in the embodiments of the human-computer interaction method. The specific implementation method may refer to a method embodiment, and will not be described herein.
The foregoing examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (6)

1.一种人机交互方法,其特征在于,包括:1. A human-computer interaction method, comprising: 获取用户指向交互物体时的动作图像;Get the action image when the user points to the interactive object; 对动作图像上用户的面部和手部进行识别,分别得到用户的面部图像和手部图像;Recognize the user's face and hands on the action image to obtain the user's face image and hand image respectively; 在以控制设备为原点建立的三维坐标系内,提取面部图像的特征值,根据面部图像的特征值标定出面部图像的水平角度和垂直角度,以确定用户的视线方向;Extracting feature values of the facial image within a three-dimensional coordinate system established with the control device as the origin, and calibrating the horizontal and vertical angles of the facial image based on the feature values to determine the user's line of sight; 提取手部图像包括开始、保持和结束的指向阶段特征,建立每个指向阶段特征的隐马尔可夫模型HMM,将所有的隐马尔可夫模型HMM进行最佳路径选择,以确定用户的手指指向;Extract the pointing phase features of the hand image, including the start, hold, and end phases, build a hidden Markov model (HMM) for each pointing phase feature, and perform optimal path selection on all the hidden Markov models (HMMs) to determine the user's finger pointing; 将用户的视线方向和手指指向的交点作为用户输入焦点,根据用户输入焦点控制交互物体执行响应动作,以实现人机交互;The intersection of the user's line of sight and the finger's direction is used as the user input focus, and the interactive object is controlled to perform a response action according to the user input focus to achieve human-computer interaction; 其中,所述根据面部图像的特征值标定面部图像的水平角度和垂直角度,具体包括:The step of calibrating the horizontal angle and the vertical angle of the facial image according to the characteristic values of the facial image specifically includes: 将面部图像从垂直方向和水平方向按照节点个数进行划分,得到表示面部图像的水平角度和垂直角度的标准编码集合Y=[Y1,Y2,Y3,Y4,Y5];Y1为面部图像在第一节点的水平角度和垂直角度划分结果,Y2为面部图像在第二节点的水平角度和垂直角度划分结果,Y3为面部图像在第三节点的水平角度和垂直角度划分结果,Y4为面部图像在第四节点的水平角度和垂直角度划分结果,Y5为面部图像在第五节点的水平角度和垂直角度划分结果;Divide the facial image vertically and horizontally according to the number of nodes to obtain a standard coding set Y = [Y 1 , Y 2 , Y 3 , Y 4 , Y 5 ] representing the horizontal angles and vertical angles of the facial image; Y 1 is the horizontal angle and vertical angle division result of the facial image at the first node, Y 2 is the horizontal angle and vertical angle division result of the facial image at the second node, Y 3 is the horizontal angle and vertical angle division result of the facial image at the third node, Y 4 is the horizontal angle and vertical angle division result of the facial image at the fourth node, and Y 5 is the horizontal angle and vertical angle division result of the facial image at the fifth node; 从面部图像中提取特征值,将特征值进行归一化处理,得到输入向量集合X=(x1,x2,…,xm)T;将输入向量输入高斯RBF神经网络GRBF,得到输出向量集合y=[y1,y2,y3,y4,y5];Extract eigenvalues from the facial image and normalize them to obtain an input vector set X = (x 1 , x 2 , …, x m ) T ; input the input vectors into a Gaussian RBF neural network GRBF to obtain an output vector set y = [y 1 , y 2 , y 3 , y 4 , y 5 ]; 其中,yj为与输入向量对应的高斯RBF神经网络GRBF的第j个输出结点的输出向量,j=1、2、3、…、n,n为输出向量维度;Where yj is the output vector of the j-th output node of the Gaussian RBF neural network GRBF corresponding to the input vector, j = 1, 2, 3, ..., n, and n is the output vector dimension; 基于标准编码和输出向量确定置信度β;所述置信度β用于表征输出向量与标准编码之间的匹配程度;如果置信度β大于等于设定阈值α,则标准编码所对应的划分结果为面部图像的水平角度和垂直角度。A confidence level β is determined based on the standard code and the output vector; the confidence level β is used to characterize the degree of match between the output vector and the standard code; if the confidence level β is greater than or equal to a set threshold α, the division result corresponding to the standard code is the horizontal angle and the vertical angle of the facial image. 2.如权利要求1所述的一种人机交互方法,其特征在于,所述对动作图像上用户的面部进行识别,具体包括:2. The human-computer interaction method according to claim 1, wherein the step of recognizing the user's face on the motion image comprises: 在RGB空间中,摄取的图像颜色的刺激值为:In the RGB space, the stimulus value of the captured image color is: 其中,代表为人眼视觉感知到的颜色光的相对光谱功率分布,r(λ)、g(λ)和b(λ)均为CIE1964XYZ光谱刺激值;in, Represents the relative spectral power distribution of color light perceived by the human eye. r(λ), g(λ), and b(λ) are all CIE1964XYZ spectral stimulus values. 根据人眼在感知颜色时的恒常性,去除亮度在表征皮肤颜色时的影响,得到皮肤色的成分值,成分值的计算公式为:According to the constancy of the human eye in perceiving color, the influence of brightness in representing skin color is removed to obtain the component value of skin color. The calculation formula of the component value is: 其中,R、G和B为图像颜色的刺激值,r为红色成分,g为绿色成分,b为蓝色成分;Among them, R, G, and B are the stimulus values of the image color, r is the red component, g is the green component, and b is the blue component; 对皮肤色的成分值进行归一化处理,将RGB空间转换成rg空间,皮肤色在rg空间中呈二维正态分布;Normalize the component values of skin color and convert the RGB space into the RG space. Skin color has a two-dimensional normal distribution in the RG space. 把用户面部从背景图像中区分出来后进行灰度处理,得到灰度图像;Separate the user's face from the background image and perform grayscale processing to obtain a grayscale image; 在灰度图像中定位面部的特征点和提取面部参数,得到用户的面部图像。Locate facial feature points and extract facial parameters in the grayscale image to obtain the user's facial image. 3.如权利要求1所述的一种人机交互方法,其特征在于,所述将输入向量输入高斯RBF神经网络GRBF,得到输出向量集合y=[y1,y2,y3,y4,y5],其公式为:3. The human-computer interaction method according to claim 1, wherein the input vector is input into the Gaussian RBF neural network (GRBF) to obtain an output vector set y = [y 1 , y 2 , y 3 , y 4 , y 5 ], and the formula is: 其中,bj为网络隐含层结点的基函数宽度;xp为第P个输入向量,p=1、2、3、…、P,P为输入向量总数;i=1、2、3、…、h,h为隐含层单元个数,wij为隐含层到输出层的连接权值,ci为网络隐含层结点的中心,σ为基函数的方差;Where bj is the basis function width of the network hidden layer node; xp is the Pth input vector, p = 1, 2, 3, ..., P, P is the total number of input vectors; i = 1, 2, 3, ..., h, h is the number of hidden layer units, wij is the connection weight from the hidden layer to the output layer, cij is the center of the network hidden layer node, and σ is the variance of the basis function; 所述基于标准编码和输出向量确定置信度β;所述置信度β用于表征输出向量与标准编码之间的匹配程度,其公式为:The confidence level β is determined based on the standard code and the output vector. The confidence level β is used to characterize the degree of matching between the output vector and the standard code, and its formula is: 其中,Yj为标准编码集合中的元素,n和j为正整数。Where Yj is an element in the standard coding set, and n and j are positive integers. 4.一种人机交互系统,其特征在于,包括:4. A human-computer interaction system, comprising: 图像获取模块,用于获取用户指向交互物体时的动作图像;An image acquisition module, used to acquire an action image when a user points to an interactive object; 图像识别模块,用于对动作图像上用户的面部和手部进行识别,分别得到用户的面部图像和手部图像;An image recognition module is used to recognize the user's face and hands in the action image to obtain the user's face image and hand image respectively; 面部处理模块,用于在以控制设备为原点建立的三维坐标系内,提取面部图像的特征值,根据面部图像的特征值标定出面部图像的水平角度和垂直角度,以确定用户的视线方向;A facial processing module is used to extract feature values of the facial image within a three-dimensional coordinate system established with the control device as the origin, and to calibrate the horizontal and vertical angles of the facial image based on the feature values to determine the user's line of sight; 其中,所述根据面部图像的特征值标定面部图像的水平角度和垂直角度,具体包括:The step of calibrating the horizontal angle and the vertical angle of the facial image according to the characteristic values of the facial image specifically includes: 将面部图像从垂直方向和水平方向按照节点个数进行划分,得到表示面部图像的水平角度和垂直角度的标准编码集合Y=[Y1,Y2,Y3,Y4,Y5];Y1为面部图像在第一节点的水平角度和垂直角度划分结果,Y2为面部图像在第二节点的水平角度和垂直角度划分结果,Y3为面部图像在第三节点的水平角度和垂直角度划分结果,Y4为面部图像在第四节点的水平角度和垂直角度划分结果,Y5为面部图像在第五节点的水平角度和垂直角度划分结果;Divide the facial image vertically and horizontally according to the number of nodes to obtain a standard coding set Y = [Y 1 , Y 2 , Y 3 , Y 4 , Y 5 ] representing the horizontal angles and vertical angles of the facial image; Y 1 is the horizontal angle and vertical angle division result of the facial image at the first node, Y 2 is the horizontal angle and vertical angle division result of the facial image at the second node, Y 3 is the horizontal angle and vertical angle division result of the facial image at the third node, Y 4 is the horizontal angle and vertical angle division result of the facial image at the fourth node, and Y 5 is the horizontal angle and vertical angle division result of the facial image at the fifth node; 从面部图像中提取特征值,将特征值进行归一化处理,得到输入向量集合X=(x1,x2,…,xm)T;将输入向量输入高斯RBF神经网络GRBF,得到输出向量集合y=[y1,y2,y3,y4,y5];Extract eigenvalues from the facial image and normalize them to obtain an input vector set X = (x 1 , x 2 , …, x m ) T ; input the input vectors into a Gaussian RBF neural network GRBF to obtain an output vector set y = [y 1 , y 2 , y 3 , y 4 , y 5 ]; 其中,yj为与输入向量对应的高斯RBF神经网络GRBF的第j个输出结点的输出向量,j=1、2、3、…、n,n为输出向量维度;Where yj is the output vector of the j-th output node of the Gaussian RBF neural network GRBF corresponding to the input vector, j = 1, 2, 3, ..., n, and n is the output vector dimension; 基于标准编码和输出向量确定置信度β;所述置信度β用于表征输出向量与标准编码之间的匹配程度;如果置信度β大于等于设定阈值α,则标准编码所对应的划分结果为面部图像的水平角度和垂直角度;Determining a confidence level β based on the standard code and the output vector; the confidence level β is used to characterize the degree of match between the output vector and the standard code; if the confidence level β is greater than or equal to a set threshold α, the division result corresponding to the standard code is the horizontal angle and the vertical angle of the facial image; 手部处理模块,用于提取手部图像包括开始、保持和结束的指向阶段特征,建立每个指向阶段特征的隐马尔可夫模型HMM,将所有的隐马尔可夫模型HMM进行最佳路径选择,以确定用户的手指指向;The hand processing module is used to extract the pointing phase features of the hand image, including the start, hold, and end phases, build a hidden Markov model (HMM) for each pointing phase feature, and perform optimal path selection on all the hidden Markov models (HMMs) to determine the user's finger pointing; 响应模块,用于将用户的视线方向和手指指向的交点作为用户输入焦点,根据用户输入焦点控制交互物体执行响应动作,以实现人机交互。The response module is used to take the intersection of the user's line of sight and the finger's pointing as the user input focus, and control the interactive object to perform a response action according to the user input focus to achieve human-computer interaction. 5.一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1-3中任一项所述的人机交互方法的步骤。5. A computer device comprising a memory and a processor, wherein the memory stores a computer program, and wherein the processor implements the steps of the human-computer interaction method according to any one of claims 1 to 3 when executing the computer program. 6.一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-3中任一项所述的人机交互方法的步骤。6. A storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the human-computer interaction method according to any one of claims 1 to 3 are implemented.
CN202411235489.0A 2024-09-04 2024-09-04 Man-machine interaction method, system, equipment and medium Active CN119045668B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411235489.0A CN119045668B (en) 2024-09-04 2024-09-04 Man-machine interaction method, system, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411235489.0A CN119045668B (en) 2024-09-04 2024-09-04 Man-machine interaction method, system, equipment and medium

Publications (2)

Publication Number Publication Date
CN119045668A CN119045668A (en) 2024-11-29
CN119045668B true CN119045668B (en) 2025-09-19

Family

ID=93581557

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411235489.0A Active CN119045668B (en) 2024-09-04 2024-09-04 Man-machine interaction method, system, equipment and medium

Country Status (1)

Country Link
CN (1) CN119045668B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108845668A (en) * 2012-11-07 2018-11-20 北京三星通信技术研究有限公司 Man-machine interactive system and method

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101299234B (en) * 2008-06-06 2011-05-11 华南理工大学 Method for recognizing human eye state based on built-in type hidden Markov model
KR101017936B1 (en) * 2008-09-18 2011-03-04 동명대학교산학협력단 System that controls the operation of the display device based on the user's gesture information recognition
CN102270035A (en) * 2010-06-04 2011-12-07 三星电子株式会社 Apparatus and method for selecting and operating object in non-touch mode
KR101302638B1 (en) * 2011-07-08 2013-09-05 더디엔에이 주식회사 Method, terminal, and computer readable recording medium for controlling content by detecting gesture of head and gesture of hand
CN108256421A (en) * 2017-12-05 2018-07-06 盈盛资讯科技有限公司 Dynamic gesture sequence real-time identification method, system and device
CN111931579B (en) * 2020-07-09 2023-10-31 上海交通大学 Automatic driving assistance system and method using eye tracking and gesture recognition techniques
CN112257696B (en) * 2020-12-23 2021-05-28 北京万里红科技股份有限公司 Sight estimation method and computing equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108845668A (en) * 2012-11-07 2018-11-20 北京三星通信技术研究有限公司 Man-machine interactive system and method

Also Published As

Publication number Publication date
CN119045668A (en) 2024-11-29

Similar Documents

Publication Publication Date Title
CN108460356B (en) An automatic face image processing system based on monitoring system
Lu et al. Inferring human gaze from appearance via adaptive linear regression
US10445602B2 (en) Apparatus and method for recognizing traffic signs
WO2020125499A9 (en) Operation prompting method and glasses
EP3647992A1 (en) Face image processing method and apparatus, storage medium, and electronic device
Li et al. Efficient 3D face recognition handling facial expression and hair occlusion
CN111783748A (en) Face recognition method and device, electronic equipment and storage medium
CN111046734A (en) Line-of-sight estimation method for multimodal fusion based on dilated convolution
Wang et al. Head pose estimation with combined 2D SIFT and 3D HOG features
WO2021258588A1 (en) Face image recognition method, apparatus and device and storage medium
Premaratne et al. Centroid tracking based dynamic hand gesture recognition using discrete hidden Markov models
CN113591763B (en) Classification recognition method and device for face shapes, storage medium and computer equipment
CN111158457A (en) A vehicle HUD human-computer interaction system based on gesture recognition
Lai et al. Appearance-based gaze tracking with free head movement
CN112446322A (en) Eyeball feature detection method, device, equipment and computer-readable storage medium
CN112906520A (en) Gesture coding-based action recognition method and device
Aziz et al. Bengali Sign Language Recognition using dynamic skin calibration and geometric hashing
CN116665282A (en) Face recognition model training method, face recognition method and device
CN115862055B (en) Pedestrian re-recognition method and device based on contrast learning and countermeasure training
CN113591797B (en) Depth video behavior recognition method
CN119045668B (en) Man-machine interaction method, system, equipment and medium
Manaf et al. Color recognition system with augmented reality concept and finger interaction: Case study for color blind aid system
CN114973389A (en) Eye movement tracking method based on coupling cascade regression
Su et al. Smart living: an interactive control system for household appliances
Bhiri et al. 2MLMD: Multi-modal leap motion dataset for home automation hand gesture recognition systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant