CN119045668B

CN119045668B - Man-machine interaction method, system, equipment and medium

Info

Publication number: CN119045668B
Application number: CN202411235489.0A
Authority: CN
Inventors: 孟军英; 王丽娜; 杨争艳
Original assignee: Shijiazhuang University
Current assignee: Shijiazhuang University
Priority date: 2024-09-04
Filing date: 2024-09-04
Publication date: 2025-09-19
Anticipated expiration: 2044-09-04
Also published as: CN119045668A

Abstract

The present invention discloses a human-computer interaction method, system, device, and medium, which relate to the field of computer vision recognition technology. The method comprises: obtaining an action image of a user pointing at an interactive object; recognizing the user's face and hand in the action image to obtain a facial image and a hand image; extracting feature values of the facial image within a coordinate system with a control device as the origin, calibrating the horizontal and vertical angles of the facial image based on the feature values to determine the user's line of sight; establishing a hidden Markov model (HMM) of the hand image's pointing phase features to determine the user's finger pointing; using the intersection of the user's line of sight and the finger pointing as the user input focus, and controlling the interactive object to perform a response action based on the user input focus to achieve human-computer interaction. The present invention improves the accuracy and recognition speed of human-computer interaction.

Description

Man-machine interaction method, system, equipment and medium

Technical Field

The invention relates to the technical field of computer vision recognition, in particular to a man-machine interaction method, a man-machine interaction system, man-machine interaction equipment and man-machine interaction media.

Background

Human-computer interaction (Human-Computer Interaction) is a technology of interaction between a researcher and a computer, and aims to utilize all possible information channels to perform Human-computer communication and improve naturalness and efficiency of interaction.

In the prior art, baluja and Pomerleau propose a method for estimating the gaze point of a user on a computer screen from images of the human eye, which are input into a neural network to infer the gaze location of the eye on the computer screen. The method comprises the specific process of capturing eye images of a user through a high-definition camera to ensure high definition of the images so as to accurately capture details such as pupils, irises and the like. The preprocessed eye image is input to the neural network, and in the inference stage of the neural network, the input eye image is converted into a series of feature maps, which are further decoded into on-screen gaze point coordinates.

The prior art has the defect that during man-machine interaction, the user is required to approach a camera and the head deflection cannot be excessive by taking eye images on a computer screen as a discrimination means of interaction operation. Misjudgment is easy to occur when the head of the customer deflects too much, so that wrong interaction operation occurs, and the expected interaction effect cannot be achieved.

Disclosure of Invention

Based on the foregoing, it is necessary to provide a man-machine interaction method, system, device and medium for solving the above technical problems.

The embodiment of the invention provides a man-machine interaction method, which comprises the following steps:

acquiring an action image when a user points to an interactive object;

Identifying the face and the hand of the user on the action image to respectively obtain a face image and a hand image of the user;

extracting characteristic values of the face image in a three-dimensional coordinate system established by taking control equipment as an origin, and calibrating the horizontal angle and the vertical angle of the face image according to the characteristic values of the face image so as to determine the sight direction of a user;

Extracting hand images, including starting, maintaining and ending pointing stage characteristics, establishing hidden Markov model HMM of each pointing stage characteristic, and carrying out optimal path selection on all hidden Markov model HMM so as to determine finger pointing of a user;

and taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.

Optionally, the step of identifying the face of the user on the action image specifically includes:

In the RGB space, stimulus values of captured image colors are:

,

Wherein, phi (lambda) represents the relative spectral power distribution of the color light perceived by human eyes, and r (lambda), g (lambda) and b (lambda) are CIE1964XYZ spectral stimulus values;

according to the constancy of human eyes in color perception, the influence of brightness in skin color representation is removed, and the component value of skin color is obtained, wherein the calculation formula of the component value is as follows:

,

Wherein R, G and B are stimulus values of image colors, r is a red component, g is a green component, and B is a blue component;

Normalizing the component values of skin colors, converting the RGB space into a r _g space, and enabling the skin colors to be in two-dimensional normal distribution in the r _g space;

Distinguishing the face of the user from the background image, and then carrying out gray processing to obtain a gray image;

and positioning the characteristic points of the face in the gray level image and extracting the face parameters to obtain the face image of the user.

Optionally, calibrating the horizontal angle and the vertical angle of the face image of the user according to the feature value of the face image specifically includes:

Dividing the facial image of the user from the vertical direction and the horizontal direction according to the number of nodes to obtain a standard coding set Y= [ Y ₁,Y₂,Y₃,Y₄,Y₅];Y₁ ] which represents the division result of the horizontal angle and the vertical angle of the facial image at a first node, Y ₂ which represents the division result of the horizontal angle and the vertical angle of the facial image at a second node, Y ₃ which represents the division result of the horizontal angle and the vertical angle of the facial image at a third node, Y ₄ which represents the division result of the horizontal angle and the vertical angle of the facial image at a fourth node, and Y ₅ which represents the division result of the horizontal angle and the vertical angle of the facial image at a fifth node;

Extracting a characteristic value from the facial image, and carrying out normalization processing on the characteristic value to obtain an input vector set X= (X ₁,x₂,… ,x_m)^T;

the input vector is input into a Gaussian RBF neural network GRBF to obtain an output vector set y= [ y ₁,y₂,y₃,y₄,y₅ ] with the formula:

,

Wherein j, m=1, 2, 3, & gt, n are output vector dimensions, y _j is an output vector of a j-th output node of the gaussian RBF neural network GRBF corresponding to the input vector, b _j is a base function width of a network hidden layer node, x _p is a P-th input vector, p=1, 2, 3, & gt, P are total number of input vectors, i=1, 2, 3, & gt, h are number of hidden layer units, w _ij is a connection weight of a hidden layer to an output layer, c _i is a center of the network hidden layer node, and σ is a variance of the base function;

the confidence coefficient beta is used for representing the matching degree between the output vector and the standard code, and the formula is as follows:

,

Wherein Y _j is an element in a standard coding set, and n and j are positive integers;

If the confidence coefficient beta is larger than or equal to the set threshold alpha, the division result corresponding to the standard code is the horizontal angle and the vertical angle of the face image of the user.

Optionally, performing optimal path selection on all Hidden Markov Model (HMM), which specifically includes:

Constructing an observation sequence o=o ₁O₂O₃……O_t,(O_t= (ω_pan,t,ω_tilt,t) based on each pointing stage feature of the hand image), wherein ω _pan,t is a translational angular velocity of the pointing stage feature corresponding to the time point t, and ω _tilt,t is a vertical angular velocity of the pointing stage feature corresponding to the time point t;

Constructing a model theta (A, B, pi), wherein pi is an initial state probability vector, A is a state transition probability matrix and B is an observation probability matrix;

Based on the observation sequence O and the model theta, a hidden Markov model HMM is constructed, and the expression P (O|theta) is as follows:

,

Wherein, O is the observation sequence, O ₁、O₂、…O_r is the value in the observation sequence, Q is the best path, Q ₁、q₂、…、q_r is the value in the best path, b _q1、b_q2、…、b_qr is the value in the observation probability matrix, a _q1、a_q2、…、a_qr is the value in the state transition probability matrix, r=1, 2, 3..and t;

p (o|θ) is represented by a parameter (a, B, pi) of the model θ, α (P (o|θ))/αθ=0, and the solution is taken as a re-estimation formula for each parameter;

and maximizing P (O|theta) according to a re-estimation formula of each parameter, obtaining an optimal path, and determining the finger pointing angle of the user through the optimal path.

The embodiment of the invention also provides a man-machine interaction system, which comprises:

The image acquisition module is used for acquiring action images when a user points to the interactive object;

the image recognition module is used for recognizing the face and the hand of the user on the action image to respectively obtain a face image and a hand image of the user;

The face processing module is used for extracting the characteristic value of the face image in a three-dimensional coordinate system established by taking the control equipment as an origin, calibrating the horizontal angle and the vertical angle of the face image according to the characteristic value of the face image, and determining the sight direction of a user;

The hand processing module is used for extracting the hand image including the starting, maintaining and ending pointing stage characteristics, establishing a Hidden Markov Model (HMM) of each pointing stage characteristic, and carrying out optimal path selection on all the hidden Markov model HMMs so as to determine the finger pointing of a user;

And the response module is used for taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.

The embodiment of the invention also provides computer equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the man-machine interaction method when executing the computer program.

The embodiment of the invention also provides a storage medium, on which a computer program is stored, which when being executed by a processor, realizes the steps of the man-machine interaction method.

Compared with the prior art, the man-machine interaction method, system, equipment and medium provided by the embodiment of the invention have the following beneficial effects:

In the prior art, when human-computer interaction is performed, a user focuses on a point on a computer screen through eye images to serve as a judging means of interaction operation, the user is required to be close to a camera, and head deflection cannot be excessive. Misjudgment is easy to occur when the head of the customer deflects too much, so that wrong interaction operation occurs, and the expected interaction effect cannot be achieved.

The face image and the hand image of the user are respectively obtained by identifying the face and the hand of the user on the action image, the sight angle of the user is determined based on the face image, the pointing angle of the finger of the user is determined based on the hand image, the intersection point of the extension lines of the sight angle and the pointing angle of the finger of the user is used as the sight focus, the user input focus is determined by combining the sight direction and the pointing direction of the finger, and the user input focus is determined by combining the sight angle and the pointing angle of the finger, so that a single judging means is avoided, the problem that misjudgment is easy to occur by taking the eye image as the judging means in the prior art is solved, and the accuracy and the identifying speed of man-machine interaction are further improved.

Drawings

FIG. 1 is a head-hand relationship cylindrical coordinate system of a human-machine interaction method provided in one embodiment;

FIG. 2 is a diagram of a head-hand relationship of pointing actions for a human-machine interaction method provided in one embodiment;

FIG. 3 is a schematic diagram of a method of human-machine interaction according to one embodiment;

FIG. 4 is a flow chart of a method of human-computer interaction provided in one embodiment;

FIG. 5 is a neural network model diagram of a human-machine interaction method provided in one embodiment;

FIG. 6 is a three-state hidden Markov state set topology of a human-machine interaction method provided in one embodiment;

Fig. 7 is a screen positioning effect diagram of a man-machine interaction method according to an embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

1. Description of the principles

As shown in fig. 1, the center point of the face after face recognition is taken as the origin of coordinates, and the projection of the line-of-sight direction and the finger pointing on the x-axis and the z-axis is 360 ° rotated and the y-axis in the vertical direction form a cylindrical coordinate system.

Δθ=(θ_Head-θ_Hand),

Δφ=(φ_Head-φ_Hand) 。

Where θ _Head and θ _Hand represent the horizontal angles of the line of sight and the finger pointing, φ _Head and φ _Hand represent the vertical angles of the line of sight and the finger pointing, and Δθ and Δφ represent the difference between the horizontal and vertical angles.

As shown in fig. 2, it was found that both were substantially fixed values during the pointing-maintaining phase, that is, the position and angular relationship of the head and hand were fixed while the human body was pointing at an object. Therefore, the relationship between the two can be applied to a human-computer interaction system to quickly position the pointing focus of the user.

In one embodiment, a human-computer interaction method is provided, which includes:

1. the working principle is as follows, as shown in fig. 3 and 4:

(1) The user is within a suitable distance from the interactive object (screen, robot, other device, etc.), the control device obtains an action image of the user pointing at the interactive object, the action image comprising an image of the complete process of the user pointing at the interactive object. On the basis of preprocessing the images, the face and the hands of the user on the action image are identified, and the face image and the hand image of the user are obtained respectively. And extracting the characteristic value of the face image in a three-dimensional coordinate system established by taking the control equipment as an origin, and calibrating the horizontal angle and the vertical angle of the face image according to the characteristic value of the face image so as to determine the sight direction of the user. Extracting hand images comprises starting, maintaining and ending pointing stage characteristics, establishing hidden Markov model HMM of each pointing stage characteristic, and carrying out optimal path selection on all hidden Markov model HMM so as to determine the finger pointing of a user. And taking the intersection point of the sight line direction of the user and the finger direction as a user input focus, and controlling the interactive object to execute response action according to the user input focus so as to realize man-machine interaction.

(2) A gaussian RBF neural network (Generalized Radial Basis Function, GRBF) that is more suitable for addressing highly nonlinear problems is used to calibrate the orientation of a person's face. And extracting relevant characteristic values from the processed face gray level images, inputting the characteristic values into a neural network which is learned and trained in advance, and calibrating the horizontal angle and the vertical angle of the face of the person through comparison and result evaluation so as to determine the sight line direction.

(3) According to the characteristics of finger pointing, a hidden Markov model (Hidden Markov Model, HMM) is adopted to determine the angle of the finger pointing. At interaction, this stage is separated out using a hidden Markov model HMM, and the finger pointing is determined.

(4) And when the sight line direction and the finger direction are determined, taking the intersection point of the sight line direction and the finger direction of the user as a user input focus, and controlling the interactive object to execute a response action according to the user input focus so as to realize man-machine interaction.

2. Sight direction identification

2.1 Facial image acquisition

According to colorimetry theory, stimulus values of captured image colors in RGB space can be calculated by the following formula:

。

Wherein phi (lambda) represents the relative spectral power distribution of color light perceived by human eyes, r (lambda), g (lambda), b (lambda) are CIE1964XYZ spectral stimulus values, and the integral range is the wave band of visible light, generally 380nm-780nm.

The brightness of the acquired image has a great influence on the accuracy of recognition. Therefore, according to the constancy of human eyes in color perception, the influence of brightness in skin color representation is removed, and the component value of skin color is obtained, wherein the calculation formula of the component value is as follows:

。

wherein R, G and B are stimulus values of image colors, r is a red component, g is a green component, and B is a blue component.

The component values of skin colors are normalized, so that RGB space can be converted into r _g space, and the skin colors are in two-dimensional normal distribution in r _g space.

And distinguishing the face of the user from the background image, and then carrying out gray processing to obtain a gray image. And positioning the characteristic points of the face in the gray level image and extracting the face parameters to obtain the face image of the user.

2.1.1 Model selection

Since facial orientation recognition is a highly non-linear problem, a gaussian RBF neural network GRBF is employed. As shown in fig. 5, the gaussian RBF neural network GRBF with m-h-n structure, m is the input vector dimension, h is the number of hidden layer units, and n is the output vector dimension. X= (X ₁, x₂,… ,x_m)^T is the input vector set of the network, and is composed of signal source nodes.

The hidden layer adopts a nonlinear optimization strategy to adjust the parameters of the activation function. The number of hidden layer elements h depends on the problem described. The radial basis function selects the gaussian function, namely:

。

Wherein c _i is the center of the network hidden layer node, σ is the variance of the basis function, x _p is the P-th input vector, p=1, 2, 3..p, P is the total number of input vectors.

Since it is difficult to find a specific angle of the face direction, the vertical angle and the horizontal angle can be divided. The face image of the user is divided according to the number of nodes from the vertical direction and the horizontal direction, 25 possible standard codes consisting of 5-bit codes can be represented, and a standard code set Y= [ Y1, Y2, Y3, Y4, Y5] representing the horizontal angle and the vertical angle of the face image is obtained. Y1 is the division result of the horizontal angle and the vertical angle of the face image at the first node, Y2 is the division result of the horizontal angle and the vertical angle of the face image at the second node, Y3 is the division result of the horizontal angle and the vertical angle of the face image at the third node, Y4 is the division result of the horizontal angle and the vertical angle of the face image at the fourth node, Y5 is the division result of the horizontal angle and the vertical angle of the face image at the fifth node, and the standard coding scheme is shown in Table 1:

table 1 face orientation standard coding

And extracting the characteristic value from the facial image, and carrying out normalization processing on the characteristic value to obtain an input vector set X= (X ₁,x₂,… ,x_m)^T).

The input vector is input into a Gaussian RBF neural network GRBF to obtain an output vector set y= [ y1, y2, y3, y4, y5], and the formula is as follows:

。

Wherein j, m=1, 2, 3, n is the output vector dimension, y _j is the output vector of the j-th output node of the gaussian RBF neural network GRBF corresponding to the input vector, b _j is the base function width of the network hidden layer node, x _p is the P-th input vector, p=1, 2, 3, P is the total number of input vectors, i=1, 2, 3, h is the number of hidden layer units, w _ij is the connection weight of the hidden layer to the output layer, c _i is the center of the network hidden layer node, and σ is the variance of the base function.

Let d be the expected output value of the input vector, the variance σ of the basis function can be expressed as:

。

Wherein c _i is the center of the node of the hidden layer of the network, y _j is the output vector of the j-th output node of the Gaussian RBF neural network GRBF corresponding to the input vector, and sigma is the variance of the basis function.

2.1.2 Results

Based on the standard code and the output vector, obtaining a confidence coefficient beta, wherein the confidence coefficient beta is used for representing the matching degree between the output vector and the standard code, and the formula is as follows:

。

Wherein Y _j is an element in the standard code set, and n and j are positive integers.

If the confidence coefficient beta is larger than or equal to the set threshold alpha, the division result corresponding to the standard code is the horizontal angle and the vertical angle of the user face image, otherwise, the division result is the face orientation which cannot be identified by the user face image corresponding to the standard code.

2.1.3 Gaussian RBF neural network GRBF

The parameters to be solved of the Gaussian RBF neural network GRBF are 3, namely the center c of the basis function, the variance sigma _i and the weight from the hidden layer to the output layer.

A. based on K-means clustering method, basis function center c is obtained

① And initializing a network. H training samples were randomly chosen as cluster centers c _i (i=1, 2,..

② The input training sample sets are grouped by nearest neighbor rule with x _p assigned to each cluster set delta _p (p=1, 2..p) by euclidean distance between x _p and center c _i.

③ Readjusting the cluster center. Calculating the average value of training samples in each cluster set delta _p, namely a new cluster center c _i, if the new cluster center is not changed any more, obtaining c _i which is the final basis function center c of the GRBF neural network, otherwise, returning to ②, and entering the center solution of the next round.

B. solving for variance sigma _i

。

Where c _max is the maximum distance between selected centers, i=1, 2.

C. calculating weight w between hidden layer and output layer

The connection weight of the neurons from the hidden layer to the output layer is directly calculated by a least square method, and the calculation formula is as follows:

。

where p=1, 2,..p, P is the total number of input vectors, i=1, 2,..h.

2.2 Finger pointing identification

The pointing procedure of the hand is divided into three pointing stage features, start, hold and end, respectively. Respective hidden Markov models HMMs are built for the three pointing stage features, whose state set topologies are shown in FIG. 6. And carrying out optimal path selection on all hidden Markov model HMMs to determine the finger pointing angle of the user.

An observation sequence o=o ₁O₂O₃……O_t,(O_t= (ω_pan,t,ω_tilt,t) is constructed based on each pointing phase feature of the hand image). Wherein ω _pan,t is the translational angular velocity of the pointing stage feature corresponding to the time point t, and ω _tilt,t is the vertical angular velocity of the pointing stage feature corresponding to the time point t.

And constructing a model theta (A, B, pi), wherein pi is an initial state probability vector, A is a state transition probability matrix and B is an observation probability matrix.

。

Where O is the observation sequence, O ₁、O₂、…、O_r is the value in the observation sequence, Q is the best path, Q ₁、q₂、…、q_r is the value in the best path, b _q1、b_q2、…、b_qr is the value in the observation probability matrix, a _q1、a_q2、…、a_qr is the value in the state transition probability matrix, r=1, 2, 3.

Then, the best path q=q ₁q₂q₃…q_t is selected to use the segmented K-means algorithm SEGMENTAL K-means based on the viterbi decoding algorithm Viterbi (Viterbi decoding), the basic idea is to represent P (o|θ) with the parameters (a, B, pi) of the model θ, let α (P (o|θ))/αθ=0, and take the solution as a re-estimation formula for each parameter. And maximizing P (O|theta) according to a re-estimation formula of each parameter, obtaining an optimal path, and determining the finger pointing angle of the user through the optimal path.

3. Experimental results and analysis

The main frequency of the computer used in the experiment is 3.0 GHz, the internal memory is 2 GB, the image acquisition equipment is a common camera, and the size of the acquired image is 352 multiplied by 288. After the user image is processed, the user image is converted into a mouse control signal to realize the control of a computer, and the speed of a visual processing part in an experiment can reach 15 frames/s. In screen positioning, a computer displays an image of a 10×10 table full screen, and an experimenter points to each cell in the table in turn to test the positioning accuracy of the screen.

Table 2 lists test data for screen positioning accuracy of the system under different lighting conditions when the experimenter is away from the camera 3 m.

When the light condition is good, the invention can better position the screen and recognize the gesture of the user. When the light condition is poor, the gray level of the image can be influenced, so that the accuracy of human eye detection and hand detection is reduced, and the accuracy of the system is influenced.

Table 2 screen positioning test

FIG. 7 shows test data of screen positioning and gesture recognition accuracy of the system for different distances from the experimenter to the camera under good light conditions. Due to the limitation of the resolution of the acquired image, when the distance from the user to the camera increases, the accuracy of screen positioning and gesture recognition of the system is somewhat reduced. However, as can be seen from fig. 7, the present invention has great advantages over conventional gaze tracking and gesture recognition techniques in terms of remote human-machine interaction.

Based on the same inventive concept, the invention also provides a human-computer interaction system, which comprises:

Furthermore, the invention also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the man-machine interaction method when executing the computer program. The specific implementation method may refer to a method embodiment, and will not be described herein.

Further, the present invention also provides a storage medium having stored thereon a computer program, such as a memory containing instructions executable by a processor of a computer device to perform the above method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. The computer program, when executed by a processor, enables the implementation of the steps in the embodiments of the human-computer interaction method. The specific implementation method may refer to a method embodiment, and will not be described herein.

The foregoing examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A human-computer interaction method, comprising:

Get the action image when the user points to the interactive object;

Recognize the user's face and hands on the action image to obtain the user's face image and hand image respectively;

Extracting feature values of the facial image within a three-dimensional coordinate system established with the control device as the origin, and calibrating the horizontal and vertical angles of the facial image based on the feature values to determine the user's line of sight;

Extract the pointing phase features of the hand image, including the start, hold, and end phases, build a hidden Markov model (HMM) for each pointing phase feature, and perform optimal path selection on all the hidden Markov models (HMMs) to determine the user's finger pointing;

The intersection of the user's line of sight and the finger's direction is used as the user input focus, and the interactive object is controlled to perform a response action according to the user input focus to achieve human-computer interaction;

The step of calibrating the horizontal angle and the vertical angle of the facial image according to the characteristic values of the facial image specifically includes:

Divide the facial image vertically and horizontally according to the number of nodes to obtain a standard coding set Y = [Y ₁ , Y ₂ , Y ₃ , Y ₄ , Y ₅ ] representing the horizontal angles and vertical angles of the facial image; Y ₁ is the horizontal angle and vertical angle division result of the facial image at the first node, Y ₂ is the horizontal angle and vertical angle division result of the facial image at the second node, Y ₃ is the horizontal angle and vertical angle division result of the facial image at the third node, Y ₄ is the horizontal angle and vertical angle division result of the facial image at the fourth node, and Y ₅ is the horizontal angle and vertical angle division result of the facial image at the fifth node;

Extract eigenvalues from the facial image and normalize them to obtain an input vector set X = (x ₁ , x ₂ , …, x _m ) ^T ; input the input vectors into a Gaussian RBF neural network GRBF to obtain an output vector set y = [y ₁ , y ₂ , y ₃ , y ₄ , y ₅ ];

Where _yj is the output vector of the j-th output node of the Gaussian RBF neural network GRBF corresponding to the input vector, j = 1, 2, 3, ..., n, and n is the output vector dimension;

A confidence level β is determined based on the standard code and the output vector; the confidence level β is used to characterize the degree of match between the output vector and the standard code; if the confidence level β is greater than or equal to a set threshold α, the division result corresponding to the standard code is the horizontal angle and the vertical angle of the facial image.

2. The human-computer interaction method according to claim 1, wherein the step of recognizing the user's face on the motion image comprises:

In the RGB space, the stimulus value of the captured image color is:

in, Represents the relative spectral power distribution of color light perceived by the human eye. r(λ), g(λ), and b(λ) are all CIE1964XYZ spectral stimulus values.

According to the constancy of the human eye in perceiving color, the influence of brightness in representing skin color is removed to obtain the component value of skin color. The calculation formula of the component value is:

Among them, R, G, and B are the stimulus values of the image color, r is the red component, g is the green component, and b is the blue component;

Normalize the component values of skin color and convert the RGB space into the _RG space. Skin color has a two-dimensional normal distribution in the _RG space.

Separate the user's face from the background image and perform grayscale processing to obtain a grayscale image;

Locate facial feature points and extract facial parameters in the grayscale image to obtain the user's facial image.

3. The human-computer interaction method according to claim 1, wherein the input vector is input into the Gaussian RBF neural network (GRBF) to obtain an output vector set y = [y ₁ , y ₂ , y ₃ , y ₄ , y ₅ ], and the formula is:

Where _bj is the basis function width of the network hidden layer node; _xp is the Pth input vector, p = 1, 2, 3, ..., P, P is the total number of input vectors; i = 1, 2, 3, ..., h, h is the number of hidden layer units, _wij is the connection weight from the hidden layer to the output layer, _cij is the center of the network hidden layer node, and σ is the variance of the basis function;

The confidence level β is determined based on the standard code and the output vector. The confidence level β is used to characterize the degree of matching between the output vector and the standard code, and its formula is:

Where _Yj is an element in the standard coding set, and n and j are positive integers.

4. A human-computer interaction system, comprising:

An image acquisition module, used to acquire an action image when a user points to an interactive object;

An image recognition module is used to recognize the user's face and hands in the action image to obtain the user's face image and hand image respectively;

A facial processing module is used to extract feature values of the facial image within a three-dimensional coordinate system established with the control device as the origin, and to calibrate the horizontal and vertical angles of the facial image based on the feature values to determine the user's line of sight;

Determining a confidence level β based on the standard code and the output vector; the confidence level β is used to characterize the degree of match between the output vector and the standard code; if the confidence level β is greater than or equal to a set threshold α, the division result corresponding to the standard code is the horizontal angle and the vertical angle of the facial image;

The hand processing module is used to extract the pointing phase features of the hand image, including the start, hold, and end phases, build a hidden Markov model (HMM) for each pointing phase feature, and perform optimal path selection on all the hidden Markov models (HMMs) to determine the user's finger pointing;

The response module is used to take the intersection of the user's line of sight and the finger's pointing as the user input focus, and control the interactive object to perform a response action according to the user input focus to achieve human-computer interaction.

5. A computer device comprising a memory and a processor, wherein the memory stores a computer program, and wherein the processor implements the steps of the human-computer interaction method according to any one of claims 1 to 3 when executing the computer program.

6. A storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the human-computer interaction method according to any one of claims 1 to 3 are implemented.