[go: up one dir, main page]

CN120526469A - Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point - Google Patents

Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point

Info

Publication number
CN120526469A
CN120526469A CN202510500113.6A CN202510500113A CN120526469A CN 120526469 A CN120526469 A CN 120526469A CN 202510500113 A CN202510500113 A CN 202510500113A CN 120526469 A CN120526469 A CN 120526469A
Authority
CN
China
Prior art keywords
gaze
target
point
dimensional
foreground
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510500113.6A
Other languages
Chinese (zh)
Inventor
金华
仇跃荣
黄琪辉
宋雪桦
王昌达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN202510500113.6A priority Critical patent/CN120526469A/en
Publication of CN120526469A publication Critical patent/CN120526469A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer vision and deep learning, and particularly relates to an eye movement tracking dynamic target identification method and device based on a three-dimensional gaze point. Firstly, screening out a three-dimensional gazing point in a gazing state according to three-dimensional gazing point data, mapping the three-dimensional gazing point into a two-dimensional gazing point in a foreground camera coordinate system, identifying a gazing target in a foreground picture of a user according to the mapped two-dimensional gazing point, and informing the user of the identification result through a result broadcasting device. And then tracking the target to ensure that the user's sight does not repeat the recognition operation when looking at the target, and starting the next recognition if the user's sight deviates from the target for a long time, thereby ensuring that the user's looking at the target can be correctly recognized. The invention effectively reduces the loss phenomenon caused by deformation or high-speed movement of the target, does not need to repeatedly carry out repeated uploading identification on the same target, reduces the communication quantity between the target and the server, and reduces the pressure of the server.

Description

Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point
Technical Field
The invention belongs to the field of computer vision and deep learning, and particularly relates to an eye movement tracking dynamic target identification method and device based on a three-dimensional gaze point.
Background method
Most of the current three-dimensional fixation point-based recognition algorithms need to recognize each frame of image when recognizing objects, which greatly increases the amount of computation, and often encounters a problem of losing the object when facing a fast moving object. This not only affects the accuracy of the identification, but also severely constrains the real-time performance of the system. In addition, since the fixation points of the eyes of the user and the target to be identified have time lag, the target identification which is simply dependent on each three-dimensional fixation point can cause a large number of false positives, and the reliability of the system is reduced.
The single-target tracker is integrated in eye movement tracking, so that the attention model of the user in a real scene can be better simulated, and the gazing target of the user can be accurately reflected. The application of the parameter to dynamic target recognition accords with the actual use experience of a user, the accuracy and the instantaneity of target recognition are obviously improved, and the method is particularly favorable for recognizing and tracking moving targets more stably in a dynamic environment.
Disclosure of Invention
Aiming at the problems, the invention provides an eye tracking dynamic identification method and device based on a three-dimensional gaze point, which are used for solving the problem that a user expects to look at a target inconsistent with an identification target and the problem of server blocking caused by repeated uploading identification.
In order to achieve the purpose, the specific method scheme of the invention is as follows, namely an eye tracking dynamic identification method based on a three-dimensional fixation point comprises the following steps:
s1, acquiring three-dimensional gaze point data of a user by using eye movement equipment, constructing a gaze point-time sequence, and screening out high-density three-dimensional gaze points of the user;
S2, establishing a space coordinate system based on eye movement equipment, wherein the center point of a binocular camera of the eye movement equipment of a user is taken as an original point, the horizontal direction of the user is taken as an X axis, the right side of the user is taken as the positive direction of the X axis, the vertical direction is taken as a Y axis, the upper side of the user is taken as the positive direction of the Y axis, the front and rear directions are taken as Z axes, and the positive direction of the Z axes is taken as the positive direction of the front;
s3, the position of the center point of the eyes of the user is E, the three-dimensional fixation point data is (x Gaze fixation ,y Gaze fixation ,z Gaze fixation ), the position of the center point of the eyes of the user and the three-dimensional fixation point data are substituted into a space coordinate system based on eye movement equipment, a sight line equation is established to obtain a fixation sight line function L (x Line of sight ,y Line of sight ,z Line of sight ) of the user, wherein the starting point of L (x Line of sight ,y Line of sight ,z Line of sight ) is the position E of the center point of the eyes of the user, and the end point is the three-dimensional fixation point (x Gaze fixation ,y Gaze fixation ,z Gaze fixation );
S4, mapping a gaze line-of-sight function L (x Line of sight ,y Line of sight ,z Line of sight ) into a foreground camera coordinate system to obtain a gaze point two-dimensional coordinate (x ', y') of a user;
S5, taking a fixation point two-dimensional coordinate (x ', y') as a center, selecting a rectangular area with a length and width set W.times.H pixels, and establishing an identification area initial coordinate (x ', y', W/2, H/2), wherein x ', y' represent the fixation point two-dimensional coordinate, W/2, H/2 represent the distance from the frame of the rectangular area to the center, and setting an image of the rectangular area with the W.times.H pixels as a template picture to be used for subsequent tracking;
S6, loading a weight file trained under a Yolov framework in advance by using a Pytorch framework, initializing a template picture identification model, and extracting deep features of the template picture by using the template picture model, wherein the deep features comprise low-layer features, middle-layer features, high-layer features and global features, and the global features aggregate the three previous features to generate context information with rich semantics for subsequent target classification and definition;
S7, inputting the extracted deep features of the template picture into a head network of a template picture recognition model to carry out classification tasks, and calculating probability distribution of each category by the head network to obtain a result with highest confidence as a recognition result of the template picture;
S8, sending the recognition result of the template picture to eye movement equipment, recording the received result by the eye movement equipment, comparing the received result with the last result, informing the user of the recognition result if the result is changed, and performing no operation if the result is not changed;
S9, initializing a single-target tracker based on a Siamese frame, extracting Jing Shipin frames before each frame from a 30FPS foreground video stream transmitted by eye movement equipment, and sequencing from small numbers to large numbers according to a time sequence. Inputting initial coordinates (x ', y', W/2, H/2) of the identification area and a template image into a template branch of a single-target tracker, inputting a foreground video frame with the minimum number into a search branch of the single-target tracker, and deleting the foreground video frame with the minimum number;
S10, carrying out feature extraction on a template branch and a search branch of a single-target tracker by adopting HRnet networks, and carrying out deep cross-correlation on an obtained template branch feature map and a search branch feature map to obtain a similarity response map;
S11, generating a plurality of anchor points (anchors) in a region with higher confidence in the similarity response graph, and respectively carrying out a regression task and a classification task on each anchor point to generate a tracking result;
S12, marking a tracking result in a video frame of a foreground image, completing a tracking process of a target image in a frame of foreground image picture, and repeating the steps S9 to S12 in a foreground video stream to realize a tracking effect of the target in the foreground video;
S13, while tracking the target in the foreground video is achieved, whether the two-dimensional coordinate (x ', y') of the gaze point of the user deviates from the tracking candidate frame for a long time is monitored, if the two-dimensional coordinate (x ', y') of the gaze point exceeds a set time threshold value and does not enter the candidate frame area, the user is judged to have lost attention to the target, the step S5 is returned to reselect the initial coordinate of the identification area and the template picture, and if the two-dimensional coordinate (x ', y') of the gaze point of the user does not deviate within the time threshold value, the last result is maintained.
Further, the step S1 includes the following:
S1.1, organizing three-dimensional fixation point data into a time sequence according to a time stamp;
s1.2, screening out high-density fixation points in a period by using a density clustering algorithm, and establishing a group of fixation point-time sequences (x, y, z, t) by combining time information, wherein x, y, z represents the high-density three-dimensional fixation points of the screened user, and t represents a time stamp.
Further, the step S1.2 includes:
s1.2.1 defines a space-time distance calculation formula,
Alpha represents a time weight coefficient;
S1.2.2 processing a time sequence by using a DBSCAN algorithm, determining epsilon and MinPts values according to the distribution condition of a three-dimensional point of regard along with time, and comparing the number of points meeting the condition in the neighborhood of each data point with MinPts, wherein epsilon is a neighborhood radius and is used for defining a neighborhood range around one point, minPts is a minimum neighbor point number, and if the number of points around one point is greater than or equal to MinPts, the point is regarded as a core point;
s1.2.3 if one point is confirmed to be a core point and other points exist in the neighborhood of the point, the points belong to the same cluster, clusters which exceed a set quantity threshold are selected, and the points in the clusters represent the screened high-density three-dimensional fixation points of the user.
Further, the step S4 includes the following:
s4.1, the foreground camera coordinate system takes the geometric center of an external camera lens arranged on the eye movement instrument as an origin, and the directions of X, Y and Z axes are consistent with the eye movement equipment space coordinate system;
S4.2, converting the user gazing sight line function L (x Line of sight ,y Line of sight ,z Line of sight ) and the internal reference of the foreground camera, converting the gazing sight line function under the coordinate system of the eye movement device into the foreground sight line function K (x Foreground of ,y Foreground of ,z Foreground of ) under the coordinate system of the foreground camera by utilizing a view transformation matrix, obtaining a conversion formula of K (x, y, z) =R+L (x, y, z) +T,
Wherein R is a 3×3 rotation matrix representing rotation of the eye movement device coordinate system relative to the foreground camera coordinate system, and T is a 3×1 translation vector representing the position of the eye movement device origin in the foreground camera coordinate system;
S4.3 under the foreground camera coordinate system, the X-axis and Y-axis coordinates of the foreground Jing Shixian function K (X Foreground of ,y Foreground of ,z Foreground of ) when Z Foreground of equals g are the gaze point two-dimensional coordinates (X ', Y ') of the user ' S three-dimensional gaze point on the foreground image, assuming that the foreground image curtain is parallel to the X-axis Y-axis under the foreground camera coordinate system, the Z-axis coordinates are g, according to the front Jing Shixian function K (X Foreground of ,y Foreground of ,z Foreground of ).
Further, the step S7 includes the following:
S7.1, the extracted deep features are transmitted to a head network of a template picture recognition model, and PAnet is used for further polymerizing and optimizing the multi-layer features so as to ensure that the model has global and local features;
S7.2, outputting confidence scores of each category according to the classification loss function yolov of the multi-layer characteristics after aggregation;
and S7.3, selecting the category with the highest confidence as a final template picture recognition result according to the confidence scores of all the categories.
Further, the step S10 includes the following:
S10.1, the template branches and the search branches input by the system adopt a characteristic extraction network structure with shared weight, so that the template branches and the search branches are matched in the same characteristic space, and the backbone network of the high-resolution characteristic extraction network HRnet with the same structure is utilized to respectively extract the characteristics of the template branches and the search branch images so as to retain more detail information;
S10.2, after feature extraction is completed, performing deep cross-correlation operation on the obtained template branch feature map and the search branch feature map, wherein the deep cross-correlation operation is to perform convolution operation on the template branch feature map and the search branch feature map in a channel dimension channel by channel to obtain a two-dimensional similarity response map, and the value of each position represents the similarity between the template and the corresponding position of the search area.
Further, the step S11 includes the following:
S11.1, based on the obtained similarity response graph, setting a threshold according to the percentage of the maximum value, and marking the area which is 80% higher than the maximum value as a high-confidence area; generating a plurality of anchor points in a high confidence coefficient region of the similarity response graph, wherein the anchor points are centered on the high confidence coefficient region, each anchor point is provided with a plurality of different scales and aspect ratios respectively so as to adapt to the shape and the size of a target, and the anchor points serve as a reference frame for target detection and provide a basis for subsequent regression and classification tasks;
S11.2, respectively carrying out a regression task and a classification task on each anchor point, wherein the regression task calculates one with highest regression score in a plurality of different scales and aspect ratios according to a regression loss function to be used as the size and aspect ratio of an anchor frame to enable the anchor frame to be closer to a target, and the classification branch generates a confidence score for each anchor point according to the classification loss function to represent the probability of the target contained in the anchor point, finds the anchor point with the highest score according to the score of the classification branch and takes a candidate frame of the anchor point as a tracking result.
The invention further provides an eye movement tracking dynamic identification device based on the three-dimensional gazing point, which comprises an eye state detection device, a two-dimensional gazing point mapping device, a target identification device, a single target tracking device, a gazing deviation monitoring device and an identification result broadcasting device, wherein the eye state detection device is used for screening the three-dimensional gazing point in a gazing state according to the three-dimensional gazing point provided by eye movement equipment and according to the gazing time and the gazing point distribution, the two-dimensional gazing point mapping device is responsible for converting the screened three-dimensional gazing point into corresponding two-dimensional coordinates and mapping the corresponding two-dimensional coordinates onto a foreground image corresponding to the two-dimensional gazing point, the target identification device is responsible for identifying an image of a user gazing point area, identifying a target through a local identification library or an online identification interface, extracting the target as a template picture, the single target tracking device is used for tracking the target, the single target tracking device is used for combining information of the tracked target with the user gazing point information, judging whether the area deviates from the target in a long time, and broadcasting the voice gazing point is used for broadcasting the identified result to a user, and the voice broadcasting is provided for a conventional operation.
Compared with the prior art, the method has the advantages that the obtained result is more in line with the target expected to be watched by the user, the loss phenomenon caused by deformation or high-speed movement of the target can be effectively reduced, and even if the target is temporarily lost, the target can be still obtained again by combining the user's watching point. Meanwhile, the method does not need to repeatedly carry out repeated uploading identification on the same target, reduces the communication quantity between the server and the method, and reduces the pressure of the server.
Drawings
Fig. 1 is a flow chart of a method for tracking a dynamic target based on eye movement of a three-dimensional gaze point.
Fig. 2 is a schematic diagram of an eye tracking dynamic target device structure based on a three-dimensional gaze point.
FIG. 3 is a flow chart of a single target tracking device.
Detailed Description
The invention will be further described with reference to the drawings and the specific examples, it being pointed out that the method variant and the design principle of the invention will be described in more detail below with only one optimized method variant, but the scope of the invention is not limited thereto.
The examples are preferred embodiments of the present invention, but the present invention is not limited to the above-described embodiments, and any obvious modifications, substitutions or variations that can be made by a person skilled in the art without departing from the spirit of the present invention are within the scope of the present invention.
The invention relates to an eye movement tracking dynamic identification method based on a three-dimensional fixation point, which is shown in fig. 1 and comprises the following steps:
s1, acquiring three-dimensional gaze point data of a user by using eye movement equipment, constructing a gaze point-time sequence, and screening out high-density three-dimensional gaze points of the user;
as a preferred embodiment of the present invention, step S1 includes the following:
S1.1, organizing three-dimensional fixation point data into a time sequence according to a time stamp;
S1.2, screening out high-density fixation points in a period by using a density clustering algorithm, and establishing a group of fixation point-time sequences (x, y, z, t) by combining time information, wherein x, y, z represents the screened out high-density three-dimensional fixation points of the user, and t represents a time stamp.
As a preferred embodiment of the present invention, step S1.2 comprises:
s1.2.1 defines a space-time distance calculation formula,
Alpha represents a time weight coefficient;
S1.2.2 processing a time sequence by using a DBSCAN algorithm, determining epsilon and MinPts values according to the distribution condition of a three-dimensional point of regard along with time, and comparing the number of points meeting the condition in the neighborhood of each data point with MinPts, wherein epsilon is a neighborhood radius and is used for defining a neighborhood range around one point, minPts is a minimum neighbor point number, and if the number of points around one point is greater than or equal to MinPts, the point is regarded as a core point;
s1.2.3 if one point is confirmed to be a core point and other points exist in the neighborhood of the point, the points belong to the same cluster, clusters which exceed a set quantity threshold are selected, and the points in the clusters represent the screened high-density three-dimensional fixation points of the user.
S2) establishing a space coordinate system based on eye movement equipment, wherein the center point of a binocular camera of the eye movement equipment of a user is taken as an original point, the horizontal direction of the user is taken as an X axis, the right side is taken as the positive direction of the X axis, the vertical direction is taken as a Y axis, the upper side is taken as the positive direction of the Y axis, the front-back direction is taken as a Z axis, and the positive direction of the Z axis is taken as the positive direction;
s3) the position of the center point of the eyes of the user is E, the three-dimensional fixation point data is (x Gaze fixation ,y Gaze fixation ,z Gaze fixation ), the position of the center point of the eyes of the user and the three-dimensional fixation point data are substituted into a space coordinate system based on eye movement equipment, a sight line equation is established to obtain a fixation sight line function L (x Line of sight ,y Line of sight ,z Line of sight ) of the user, wherein the starting point of L (x Line of sight ,y Line of sight ,z Line of sight ) is the position E of the center point of the eyes of the user, and the end point is the three-dimensional fixation point (x Gaze fixation ,y Gaze fixation ,z Gaze fixation );
S4, mapping the gaze point sight line function L (x Line of sight ,y Line of sight ,z Line of sight ) into a foreground camera coordinate system to obtain the gaze point two-dimensional coordinates (x ', y') of the user.
As a preferred embodiment of the present invention, step S4 includes the following:
s4.1, the foreground camera coordinate system takes the geometric center of an external camera lens arranged on the eye movement instrument as an origin, and the directions of X, Y and Z axes are consistent with the eye movement equipment space coordinate system;
S4.2, converting the user gazing sight line function L (x Line of sight ,y Line of sight ,z Line of sight ) and the internal reference of the foreground camera, converting the gazing sight line function under the coordinate system of the eye movement device into the foreground sight line function K (x Foreground of ,y Foreground of ,z Foreground of ) under the coordinate system of the foreground camera by utilizing a view transformation matrix, obtaining a conversion formula of K (x, y, z) =R+L (x, y, z) +T,
Wherein R is a 3×3 rotation matrix representing rotation of the eye movement device coordinate system relative to the foreground camera coordinate system, and T is a 3×1 translation vector representing the position of the eye movement device origin in the foreground camera coordinate system;
S4.3 under the foreground camera coordinate system, the X-axis and Y-axis coordinates of the foreground Jing Shixian function K (X Foreground of ,y Foreground of ,z Foreground of ) when Z Foreground of equals g are the gaze point two-dimensional coordinates (X ', Y ') of the user ' S three-dimensional gaze point on the foreground image, assuming that the foreground image curtain is parallel to the X-axis Y-axis under the foreground camera coordinate system, the Z-axis coordinates are g, according to the front Jing Shixian function K (X Foreground of ,y Foreground of ,z Foreground of ).
S5, taking a fixation point two-dimensional coordinate (x ', y') as a center, selecting a rectangular area with a length and width set W.times.H pixels, and establishing an identification area initial coordinate (x ', y', W/2, H/2), wherein x ', y' represent the fixation point two-dimensional coordinate, W/2, H/2 represent the distance from the frame of the rectangular area to the center, and setting an image of the rectangular area with the W.times.H pixels as a template picture to be used for subsequent tracking;
S6, loading a weight file trained under a Yolov framework in advance by using a Pytorch framework, initializing a template picture identification model, and extracting deep features of the template picture by using the template picture model, wherein the deep features comprise low-layer features, middle-layer features, high-layer features and global features, and the global features aggregate the three previous features to generate context information with rich semantics for subsequent target classification and definition.
As a preferred embodiment of the invention, the invention uses Pytorch frames to load weight files that have been trained in advance under Yolov8 frames, initializing a template picture model. The weight file mainly contains the following four contents:
s6.1 model architecture information comprises specific layers of a network, and parameters of each layer comprise channel number, stride and convolution kernel size.
S6.2, loss function related parameters, including the regression loss of the boundary box, are used for predicting the coordinates and the size of the boundary box. And judging whether an object exists at a certain position or not according to the target confidence loss. And the classification loss is responsible for detecting classification tasks of the category.
S6.3, detecting head parameters, which are used for boundary box prediction and comprise weights of a classification head, a confidence coefficient head and a regression head.
And S6.4, training state information, including training super-parameters such as learning rate, batch Size and WEIGHT DECAY.
As a preferred embodiment of the invention, the template picture is input into the template picture model for preprocessing, the image is scaled up equally, and it is adjusted (640) to black fill on the shorter side to complement the size. And carrying out normalization operation on the amplified pictures, and adjusting the channel sequence to be (3,640,640) through dimension conversion so as to adapt to the input requirement of the model. Inputting the preprocessed image into a template picture recognition model, and extracting deep features through a backbone network of the template picture recognition model, wherein the deep features comprise low-layer features, middle-layer features, high-layer features and global features. The low-layer feature resolution is higher, and rich edge and texture information is reserved. The middle layer features are used to extract the contour, shape and local structural information of the object. The high-level features are used to identify the class of the object, background information. The global features aggregate the three previous features to generate semantically rich context information for subsequent target classification and definition.
S7, inputting the extracted deep features of the template picture into a head network of a template picture recognition model to carry out classification tasks, and calculating probability distribution of each category by the head network to obtain a result with highest confidence as a recognition result of the template picture.
As a preferred embodiment of the present invention, step S7 includes the following:
S7.1, the extracted deep features are transmitted to a head network of a template picture recognition model, and PAnet is used for further polymerizing and optimizing the multi-layer features so as to ensure that the model has global and local features;
S7.2, outputting confidence scores of each category according to the classification loss function yolov of the multi-layer characteristics after aggregation;
and S7.3, selecting the category with the highest confidence as a final template picture recognition result according to the confidence scores of all the categories.
S8, sending the recognition result of the template picture to eye movement equipment, recording the received result by the eye movement equipment, comparing the received result with the last result, informing the user of the recognition result if the result is changed, and performing no operation if the result is not changed;
S9, initializing a single-target tracker based on a Siamese frame, extracting Jing Shipin frames before each frame from a 30FPS foreground video stream transmitted by eye movement equipment, and sequencing from small numbers to large numbers according to a time sequence. Inputting initial coordinates (x ', y', W/2, H/2) of the identification area and a template image into a template branch of a single-target tracker, inputting a foreground video frame with the minimum number into a search branch of the single-target tracker, and deleting the foreground video frame with the minimum number;
and S10, extracting features of a template branch and a search branch of the single-target tracker by adopting HRnet networks, and performing deep cross-correlation on the obtained template branch feature map and the obtained search branch feature map to obtain a similarity response map.
As a preferred embodiment of the present invention, step S10 includes the following:
S10.1, the template branches and the search branches input by the system adopt a characteristic extraction network structure with shared weight, so that the template branches and the search branches are matched in the same characteristic space, and the backbone network of the high-resolution characteristic extraction network HRnet with the same structure is utilized to respectively extract the characteristics of the template branches and the search branch images so as to retain more detail information;
S10.2, after feature extraction is completed, performing deep cross-correlation operation on the obtained template branch feature map and the search branch feature map, wherein the deep cross-correlation operation is to perform convolution operation on the template branch feature map and the search branch feature map in a channel dimension channel by channel to obtain a two-dimensional similarity response map, and the value of each position represents the similarity between the template and the corresponding position of the search area.
S11, generating a plurality of anchor points (anchors) in a region with higher confidence in the similarity response graph, and respectively carrying out a regression task and a classification task on each anchor point to generate a tracking result;
As a preferred embodiment of the present invention, step S11 includes the following:
S11.1, based on the obtained similarity response graph, setting a threshold according to the percentage of the maximum value, and marking the area which is 80% higher than the maximum value as a high-confidence area; generating a plurality of anchor points in a high confidence coefficient region of the similarity response graph, wherein the anchor points are centered on the high confidence coefficient region, each anchor point is provided with a plurality of different scales and aspect ratios respectively so as to adapt to the shape and the size of a target, and the anchor points serve as a reference frame for target detection and provide a basis for subsequent regression and classification tasks;
S11.2, respectively carrying out a regression task and a classification task on each anchor point, wherein the regression task calculates one with highest regression score in a plurality of different scales and aspect ratios according to a regression loss function to be used as the size and aspect ratio of an anchor frame to enable the anchor frame to be closer to a target, and the classification branch generates a confidence score for each anchor point according to the classification loss function to represent the probability of the target contained in the anchor point, finds the anchor point with the highest score according to the score of the classification branch and takes a candidate frame of the anchor point as a tracking result.
S12, marking a tracking result in a video frame of a foreground image, completing a tracking process of a target image in a frame of foreground image picture, and repeating S9 to S12 in a foreground video stream to realize a tracking effect of the target in the foreground video;
S13, while tracking the target in the foreground video is achieved, whether the two-dimensional coordinate (x ', y') of the gaze point of the user deviates from the tracking candidate frame for a long time is monitored, if the two-dimensional coordinate (x ', y') of the gaze point exceeds a set time threshold value and does not enter the candidate frame area, the user is judged to have lost attention to the target, the step S5 is returned to reselect the initial coordinate of the identification area and the template picture, and if the two-dimensional coordinate (x ', y') of the gaze point of the user does not deviate within the time threshold value, the last result is maintained.
The invention further provides an eye movement tracking dynamic identification device based on the three-dimensional gazing point, which comprises an eye state detection device, a two-dimensional gazing point mapping device, a target identification device, a single target tracking device, a gazing deviation monitoring device and an identification result broadcasting device, wherein the eye state detection device is used for screening the three-dimensional gazing point in a gazing state according to the three-dimensional gazing point provided by eye movement equipment and according to the gazing time and the gazing point distribution, the two-dimensional gazing point mapping device is responsible for converting the screened three-dimensional gazing point into corresponding two-dimensional coordinates and mapping the corresponding two-dimensional coordinates onto a foreground image corresponding to the two-dimensional gazing point, the target identification device is responsible for identifying an image of a user gazing point area, identifying a target through a local identification library or an online identification interface, extracting the target as a template picture, the single target tracking device is used for tracking the target, the single target tracking device is used for combining information of the tracked target with the user gazing point information, judging whether the area deviates from the target in a long time, and broadcasting the voice gazing point is used for broadcasting the identified result to a user, and the voice broadcasting is provided for a conventional operation.
As a preferred embodiment of the present invention, as shown in fig. 3, the implementation of the single-object tracking device of the present invention includes the following:
1) The improved Siamese model is loaded, and the initial coordinates (x ', y', W/2, H/2) of the identification area and the template image are input into the template branch of the single-target tracker. And inputting the foreground video frame with the minimum number into a searching branch of the single-target tracker, and deleting the foreground video frame with the minimum number.
2) The system integration HRnet network performs feature extraction on the template image and the search image respectively to retain more high-resolution detailed information.
3) And performing depth cross-correlation operation on the obtained template branch feature map and the search branch feature map to generate a two-dimensional similarity response map, wherein the value of each position represents the similarity between the template and the corresponding position of the search area.
4) And screening out a high-confidence-degree region according to a set threshold value based on the obtained similarity response graph, and generating a plurality of anchor points in the high-confidence-degree region for subsequent target positioning.
5) And respectively carrying out a regression task and a classification task for each anchor point, wherein the regression task determines the size and the aspect ratio of the anchor frame. The classification task finds the anchor point with the highest score and takes its candidate frame as the tracking result.
6) And outputting a final result, continuously inputting the foreground video frame with the minimum number into a searching branch of the single-target tracker, and repeating the processes from the first step to the sixth step.

Claims (8)

1.一种基于三维注视点的眼动追踪动态目标识别方法,其特征在于,包括如下步骤:1. A method for dynamic target recognition based on eye tracking of three-dimensional gaze points, characterized by comprising the following steps: S1利用眼动设备获取用户三维注视点数据,构建注视点-时间序列,筛选出用户的高密度三维注视点;S1 uses eye-tracking equipment to obtain the user's 3D gaze point data, constructs a gaze point-time series, and filters out the user's high-density 3D gaze points; S2建立基于眼动设备的空间坐标系,以用户眼动设备双目摄像头的中心点为原点,用户的水平方向为X轴,右侧为X轴的正方向;垂直方向为Y轴,上方为Y轴的正方向;前后方向为Z轴,正前方为Z轴的正方向;S2 establishes a spatial coordinate system based on the eye tracking device, with the center point of the binocular camera of the user's eye tracking device as the origin, the user's horizontal direction as the X-axis, and the right side as the positive direction of the X-axis; the vertical direction as the Y-axis, and the upward direction as the positive direction of the Y-axis; the front-back direction as the Z-axis, and the front direction as the positive direction of the Z-axis; S3用户双眼中心点的位置为E,三维注视点数据为(x注视,y注视,z注视),将用户双眼中心点的位置和三维注视点数据和代入到基于眼动设备的空间坐标系中,建立视线方程得到用户的注视视线函数L(x视线,y视线,z视线),其中L(x视线,y视线,z视线)的起点为用户双眼中心点的位置E,终点为三维注视点(x注视,y注视,z注视);S3: The position of the center of the user's eyes is E, and the three-dimensional gaze point data is (x gaze , y gaze , z gaze ). Substitute the position of the center of the user's eyes and the three-dimensional gaze point data into the spatial coordinate system based on the eye tracking device, and establish the gaze equation to obtain the user's gaze function L(x gaze , y gaze , z gaze ), where the starting point of L(x gaze , y gaze , z gaze ) is the position of the center of the user's eyes E, and the end point is the three-dimensional gaze point (x gaze , y gaze , z gaze ); S4将注视点视线函数L(x视线,y视线,z视线)映射到前景相机坐标系中,得到用户的注视点二维坐标(x’,y’);S4 maps the gaze point line of sight function L (x line of sight , y line of sight , z line of sight ) to the foreground camera coordinate system to obtain the two-dimensional coordinates of the user's gaze point (x', y'); S5以注视点二维坐标(x’,y’)为中心,选择一个长宽设定好的W*H像素的矩形区域,建立识别区域初始坐标(x’,y’,W/2,H/2),其中x’,y’代表注视点二维坐标,W/2,H/2代表矩形区域边框距离中心的距离,将所述W*H像素的矩形区域的图像设置为模板图片用作后续追踪;S5 selects a rectangular area with a set length and width of W*H pixels with the two-dimensional coordinates (x', y') of the gaze point as the center, establishes the initial coordinates of the recognition area (x', y', W/2, H/2), where x', y' represent the two-dimensional coordinates of the gaze point, W/2, H/2 represent the distance from the border of the rectangular area to the center, and sets the image of the W*H pixel rectangular area as a template image for subsequent tracking; S6使用Pytorch框架加载预先在Yolov8框架下训练好的权重文件,初始化模板图片识别模型,利用模板图片模型提取模板图片深层特征,所述深层特征包含低层特征,中层特征,高层特征和全局特征,全局特征聚合前面三层特征,生成语义丰富的上下文信息,用于后续的目标分类和定义;S6 uses the Pytorch framework to load the weight file pre-trained in the Yolov8 framework, initializes the template image recognition model, and uses the template image model to extract deep features of the template image. The deep features include low-level features, mid-level features, high-level features, and global features. The global features aggregate the features of the first three layers to generate semantically rich contextual information for subsequent target classification and definition; S7将提取到的模板图片深层特征输入模板图片识别模型的头部网络进行分类任务,头部网络计算每个类别的概率分布,得出置信度最高的结果作为模板图片的识别结果;S7 inputs the extracted deep features of the template image into the head network of the template image recognition model for classification. The head network calculates the probability distribution of each category and obtains the result with the highest confidence as the recognition result of the template image; S8将模板图片的识别结果发送给眼动设备,眼动设备记录下本次接收到的结果,并将本次结果与上一次结果进行比对,如结果发生改变则将识别结果告知用户,不发生改变则不做任何操作;S8 sends the recognition result of the template image to the eye tracking device, which records the result received this time and compares the current result with the previous result. If the result changes, the recognition result is notified to the user; if there is no change, no operation is performed; S9初始化基于Siamese框架的单目标追踪器,从眼动设备输送的30FPS前景视频流中,提取出其中的每一帧前景视频帧,按照时间的顺序由小到大编号排序;将识别区域初始坐标(x’,y’,W/2,H/2)、模板图像输入单目标追踪器的模板分支,将编号最小的前景视频帧输入单目标追踪器的搜索分支,并删除编号最小的前景视频帧;S9 initializes a single-target tracker based on the Siamese framework. It extracts each foreground video frame from the 30FPS foreground video stream transmitted by the eye tracking device and numbers them in ascending order of time. It inputs the initial coordinates (x’, y’, W/2, H/2) of the recognition area and the template image into the template branch of the single-target tracker. It inputs the foreground video frame with the smallest number into the search branch of the single-target tracker and deletes the foreground video frame with the smallest number. S10采用HRnet网络对单目标追踪器的模板分支、搜索分支进行特征提取,将得到的模板分支特征图及搜索分支特征图进行深度互相关得到相似度响应图;S10 uses the HRnet network to extract features from the template branch and search branch of the single target tracker, and performs deep cross-correlation on the obtained template branch feature map and search branch feature map to obtain a similarity response map; S11在相似度响应图中置信度较高的区域生成多个锚点(anchor),对每个锚点分别进行回归任务和分类任务,生成追踪结果;S11 generates multiple anchor points in the area with higher confidence in the similarity response map, performs regression and classification tasks on each anchor point, and generates tracking results; S12在前景图像的视频帧中将追踪结果标记出来,完成对一帧前景图像画面中目标图像的追踪过程,在前景视频流中重复步骤S9至S12,实现对前景视频中目标的追踪效果;S12 marks the tracking result in the video frame of the foreground image, completing the tracking process of the target image in one frame of the foreground image, and repeating steps S9 to S12 in the foreground video stream to achieve the tracking effect of the target in the foreground video; S13在实现对前景视频中目标的追踪的同时,监测用户的注视点二维坐标(x’,y’)是否长时间偏离追踪候选框,如果注视点二维坐标(x’,y’)超过设定的时间阈值未进入候选框区域,则判定用户已经对此目标失去注意力,返回S5重新选定识别区域初始坐标和模板图片;如在时间阈值内未偏离,则维持上一次结果。While tracking the target in the foreground video, S13 monitors whether the two-dimensional coordinates (x’, y’) of the user's gaze point deviates from the tracking candidate frame for a long time. If the two-dimensional coordinates (x’, y’) of the gaze point exceed the set time threshold and do not enter the candidate frame area, it is determined that the user has lost attention on the target, and returns to S5 to reselect the initial coordinates of the recognition area and the template image; if there is no deviation within the time threshold, the previous result is maintained. 2.如权利要求1所述的基于三维注视点的眼动追踪动态目标识别方法,其特征在于,所述步骤S1包括以下内容:2. The method for dynamic target recognition based on three-dimensional gaze point eye tracking according to claim 1, wherein step S1 comprises the following contents: S1.1根据时间戳将三维注视点数据组织成时间序列;S1.1 Organize the 3D gaze point data into time series according to timestamps; S1.2利用密度聚类算法筛选出时段内高密度注视点,结合时间信息建立一组注视点-时间序列(x,y,z,t),其中x,y,z代表筛选出用户的高密度三维注视点,t代表时间戳。S1.2 uses a density clustering algorithm to filter out high-density gaze points within a time period, and combines time information to establish a set of gaze point-time series (x, y, z, t), where x, y, z represent the high-density three-dimensional gaze points of the filtered user, and t represents the timestamp. 3.如权利要求2所述的基于三维注视点的眼动追踪动态目标识别方法,其特征在于,所述步骤S1.2包括:3. The method for dynamic target recognition based on three-dimensional gaze point eye tracking according to claim 2, wherein step S1.2 comprises: S1.2.1定义时空距离计算公式,S1.2.1 Define the space-time distance calculation formula, α代表时间权重系数; α represents the time weight coefficient; S1.2.2使用DBSCAN算法处理时间序列,根据三维注视点随时间的分布情况,确定ε和MinPts值,将每个数据点的邻域中符合条件的点数量与MinPts比较;其中ε为邻域半径,用于定义一个点周围的邻域范围;MinPts为最小邻居点数,如果某个点周围的点数量大于等于MinPts,则该点被视为核心点;S1.2.2 Use the DBSCAN algorithm to process the time series. Based on the temporal distribution of 3D fixation points, determine the ε and MinPts values. Compare the number of eligible points in the neighborhood of each data point with MinPts. ε is the neighborhood radius, which defines the range of the neighborhood around a point. MinPts is the minimum number of neighboring points. If the number of points around a point is greater than or equal to MinPts, the point is considered a core point. S1.2.3如果一个点确认是核心点,且其邻域内存在其他点,则这些点归属同一个簇;选取包含超过设定数量阈值的簇,簇中的点代表筛选出的用户高密度三维注视点。S1.2.3 If a point is confirmed to be a core point and there are other points in its neighborhood, these points belong to the same cluster; clusters containing more than the set number threshold are selected, and the points in the cluster represent the selected user's high-density three-dimensional gaze points. 4.如权利要求1所述的基于三维注视点的眼动追踪动态目标识别方法,其特征在于,所述步骤S4包括以下内容:4. The method for dynamic target recognition based on three-dimensional gaze point eye tracking according to claim 1, wherein step S4 comprises the following contents: S4.1前景相机坐标系以安装在眼动仪上的外部摄像头透镜的几何中心为原点,X,Y,Z轴的方向与眼动设备空间坐标系一致;The S4.1 foreground camera coordinate system has the geometric center of the lens of the external camera mounted on the eye tracker as its origin, and the directions of the X, Y, and Z axes are consistent with the spatial coordinate system of the eye tracking device; S4.2将用户注视视线函数L(x视线,y视线,z视线)与前景相机的内参进行转换,利用视图变换矩阵,将眼动设备坐标系下的注视视线函数转换到前景摄像机坐标系下得到前景摄像机坐标系下的前景视线函数K(x前景,y前景,z前景),转换公式为K(x,y,z)=R*L(x,y,z)+T,S4.2 converts the user's gaze function L(x -line , y -line , z -line ) with the intrinsic parameters of the foreground camera. Using the view transformation matrix, the gaze function in the eye tracking device coordinate system is converted to the foreground camera coordinate system to obtain the foreground gaze function K( xforeground , yforeground , zforeground ) in the foreground camera coordinate system. The conversion formula is K(x,y,z)=R*L(x,y,z)+T. 其中R为3×3旋转矩阵,表示眼动设备坐标系相对于前景摄像机坐标系的旋转,T为3×1平移向量,表示眼动设备原点在前景摄像机坐标系中的位置;Where R is a 3×3 rotation matrix, representing the rotation of the eye tracking device coordinate system relative to the foreground camera coordinate system, and T is a 3×1 translation vector, representing the position of the eye tracking device origin in the foreground camera coordinate system; S4.3在前景摄像机坐标系下,根据前景视线函数K(x前景,y前景,z前景),假设前景图像幕布在前景摄像机坐标系下平行于X轴Y轴,Z轴坐标为g,前景视线函数K(x前景,y前景,z前景)在z前景等于g时的X轴和Y轴坐标为用户的三维注视点在前景图像上的注视点二维坐标(x’,y’)。S4.3 In the foreground camera coordinate system, according to the foreground line of sight function K( xforeground , yforeground , zforeground ), assuming that the foreground image curtain is parallel to the X-axis and Y-axis in the foreground camera coordinate system, and the Z-axis coordinate is g, the X-axis and Y-axis coordinates of the foreground line of sight function K( xforeground , yforeground , zforeground ) when zforeground is equal to g are the two-dimensional coordinates (x', y') of the user's three-dimensional gaze point on the foreground image. 5.如权利要求1所述的基于三维注视点的眼动追踪动态目标识别方法,其特征在于,所述步骤S7包括以下内容:5. The method for dynamic target recognition based on three-dimensional gaze point eye tracking according to claim 1, wherein step S7 comprises the following contents: S7.1将提取到的深层特征传递到模板图片识别模型的头部网络,使用PAnet进一步聚合和优化多层特征,以确保模型兼顾全局和局部特征;S7.1 passes the extracted deep features to the head network of the template image recognition model, and uses PAnet to further aggregate and optimize multi-layer features to ensure that the model takes into account both global and local features; S7.2聚合后的多层特征根据yolov8的分类损失函数,输出每个类别的置信度得分;S7.2 outputs the confidence score of each category based on the multi-layer features after aggregation according to the classification loss function of yolov8; S7.3根据所有类别的置信度得分,选择具有最高置信度的类别作为最终的模板图片识别结果。S7.3 selects the category with the highest confidence as the final template image recognition result based on the confidence scores of all categories. 6.如权利要求1所述的基于三维注视点的眼动追踪动态目标识别方法,其特征在于,所述步骤S10包括以下内容:6. The method for dynamic target recognition based on three-dimensional gaze point eye tracking according to claim 1, wherein step S10 comprises the following: S10.1输入的模板分支和搜索分支采用共享权重的特征提取网络结构,确保模板分支和搜索分支在相同的特征空间中进行匹配,利用相同结构的高分辨率特征提取网络HRnet的骨干网络分别对模板分支和搜索分支图像进行特征提取以保留更多的细节信息;The template branch and search branch of S10.1 use a feature extraction network structure with shared weights to ensure that the template branch and search branch are matched in the same feature space. The backbone network of the high-resolution feature extraction network HRnet with the same structure is used to extract features from the template branch and search branch images respectively to retain more detailed information. S10.2在完成特征提取之后,将得到的模板分支特征图以及搜索分支特征图进行深度互相关操作,所述深度互相关操作是将模板分支特征图和搜索分支特征图在通道维度上逐通道地进行卷积操作,得到一个二维的相似度响应图,其中每个位置的值代表模板与搜索区域对应位置之间的相似度。S10.2 After completing the feature extraction, the obtained template branch feature map and the search branch feature map are subjected to a deep cross-correlation operation. The deep cross-correlation operation is to perform a convolution operation on the template branch feature map and the search branch feature map channel by channel in the channel dimension to obtain a two-dimensional similarity response map, in which the value of each position represents the similarity between the template and the corresponding position of the search area. 7.如权利要求1所述的基于三维注视点的眼动追踪动态目标识别方法,其特征在于,所述步骤S11包括以下内容:7. The method for dynamic target recognition based on three-dimensional gaze point eye tracking according to claim 1, wherein step S11 comprises the following contents: S11.1基于获得的相似度响应图,根据最大值的百分比来设置阈值,将高于最大值80%的区域标记为高置信度区域;在相似度响应图的高置信度区域生成多个锚点,这些锚点以高置信度区域为中心,每个分别具有多种不同的尺度和纵横比,以适应目标的形状和大小,作为目标检测的参考框,为后续的回归和分类任务提供基础;S11.1 sets a threshold based on the percentage of the maximum value based on the obtained similarity response map, marking the area above 80% of the maximum value as a high-confidence area. Multiple anchor points are generated in the high-confidence area of the similarity response map. These anchor points are centered in the high-confidence area and each has a variety of scales and aspect ratios to adapt to the shape and size of the target. These anchor points serve as reference frames for target detection and provide a basis for subsequent regression and classification tasks. S11.2对于每个锚点分别进行回归任务和分类任务,回归任务根据回归损失函数在多种不同的尺度和纵横比中计算出回归得分最高的一个作为锚框的尺寸和纵横比,使其更接近目标;分类分支根据分类损失函数为每个锚点生成一个置信度得分,表示该锚点包含的目标的概率,根据分类分支的得分找出得分最高的锚点,将它的候选框作为追踪的结果。S11.2 performs regression and classification tasks for each anchor point. The regression task calculates the one with the highest regression score among multiple scales and aspect ratios based on the regression loss function as the size and aspect ratio of the anchor box, making it closer to the target; the classification branch generates a confidence score for each anchor point based on the classification loss function, which indicates the probability of the target contained in the anchor point. The anchor point with the highest score is found based on the score of the classification branch, and its candidate box is used as the tracking result. 8.一种基于三维注视点的眼动追踪动态目标识别装置,其特征在于,包括眼部状态检测装置,二维注视点映射装置,目标识别装置,单目标追踪装置,注视偏离监测装置,识别结果播报装置,所述眼部状态检测装置根据眼动设备提供的三维注视点,根据注视时间和注视点的分布筛选出处于凝视状态的三维注视点;所述二维注视点映射装置负责将筛选出的三维注视点转化为对应的二维坐标并且映射到与其时间对应的前景图像上去;所述目标识别装置负责识别对用户注视点区域的图像通过本地识别库或联机识别接口对目标进行识别,并且提取其为模板图片;所述单目标追踪装置用于进行目标追踪,通过提取到的模板图片,对前景图像中的目标进行追踪;所述注视偏离监测装置用于将追踪目标的信息和用户注视点信息相结合,判断注视区域是否长时间偏离追踪目标;所述识别结果播报装置用于将识别到的结果通过语音播报给用户,并且提供常规操作提示语音播报。8. A three-dimensional gaze-based eye tracking dynamic target recognition device, characterized by comprising an eye state detection device, a two-dimensional gaze point mapping device, a target recognition device, a single target tracking device, a gaze deviation monitoring device, and a recognition result broadcasting device, wherein the eye state detection device screens out three-dimensional gaze points in a staring state based on the three-dimensional gaze points provided by the eye movement device, the gaze time, and the distribution of the gaze points; the two-dimensional gaze point mapping device is responsible for converting the screened three-dimensional gaze points into corresponding two-dimensional coordinates and mapping them to a foreground image corresponding to their time; the target recognition device is responsible for identifying the target in the image of the user's gaze point area through a local recognition library or an online recognition interface, and extracting it as a template image; the single target tracking device is responsible for tracking the target and tracking the target in the foreground image using the extracted template image; the gaze deviation monitoring device is responsible for combining the tracking target information with the user's gaze point information to determine whether the gaze area has deviated from the tracking target for a long time; the recognition result broadcasting device is responsible for broadcasting the recognition result to the user through voice and providing a voice broadcast of routine operation prompts.
CN202510500113.6A 2025-04-21 2025-04-21 Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point Pending CN120526469A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510500113.6A CN120526469A (en) 2025-04-21 2025-04-21 Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510500113.6A CN120526469A (en) 2025-04-21 2025-04-21 Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point

Publications (1)

Publication Number Publication Date
CN120526469A true CN120526469A (en) 2025-08-22

Family

ID=96746201

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510500113.6A Pending CN120526469A (en) 2025-04-21 2025-04-21 Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point

Country Status (1)

Country Link
CN (1) CN120526469A (en)

Similar Documents

Publication Publication Date Title
US11830230B2 (en) Living body detection method based on facial recognition, and electronic device and storage medium
CN109766830B (en) Ship target identification system and method based on artificial intelligence image processing
CN109829398B (en) A method for object detection in video based on 3D convolutional network
CN111310731B (en) Video recommendation method, device, equipment and storage medium based on artificial intelligence
Pang et al. Visual haze removal by a unified generative adversarial network
US7324693B2 (en) Method of human figure contour outlining in images
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN108960059A (en) A kind of video actions recognition methods and device
KR100799990B1 (en) 3D image conversion device and method of 2D image
Fang et al. Deep3DSaliency: Deep stereoscopic video saliency detection model by 3D convolutional networks
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN114663835B (en) Pedestrian tracking method, system, equipment and storage medium
CN116563340B (en) Visual SLAM method based on deep learning under dynamic environment
CN113591545B (en) Deep learning-based multi-level feature extraction network pedestrian re-identification method
JP7312026B2 (en) Image processing device, image processing method and program
JP4166143B2 (en) Face position extraction method, program for causing computer to execute face position extraction method, and face position extraction apparatus
CN113688804B (en) Multi-angle video-based action identification method and related equipment
CN119729207A (en) Photographic focusing control method based on machine vision
CN112926667A (en) Method and device for detecting saliency target of depth fusion edge and high-level feature
CN114882537A (en) Finger new visual angle image generation method based on nerve radiation field
CN111145361A (en) Naked eye 3D display vision improving method
CN108932532A (en) A kind of eye movement data number suggesting method required for the prediction of saliency figure
US20250225175A1 (en) Object search via re-ranking
CN120526469A (en) Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point
CN114155273B (en) Video image single-target tracking method combining historical track information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination