CN120526469A - Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point - Google Patents
Eye movement tracking dynamic target identification method and device based on three-dimensional fixation pointInfo
- Publication number
- CN120526469A CN120526469A CN202510500113.6A CN202510500113A CN120526469A CN 120526469 A CN120526469 A CN 120526469A CN 202510500113 A CN202510500113 A CN 202510500113A CN 120526469 A CN120526469 A CN 120526469A
- Authority
- CN
- China
- Prior art keywords
- gaze
- target
- point
- dimensional
- foreground
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Image Analysis (AREA)
Abstract
The invention belongs to the field of computer vision and deep learning, and particularly relates to an eye movement tracking dynamic target identification method and device based on a three-dimensional gaze point. Firstly, screening out a three-dimensional gazing point in a gazing state according to three-dimensional gazing point data, mapping the three-dimensional gazing point into a two-dimensional gazing point in a foreground camera coordinate system, identifying a gazing target in a foreground picture of a user according to the mapped two-dimensional gazing point, and informing the user of the identification result through a result broadcasting device. And then tracking the target to ensure that the user's sight does not repeat the recognition operation when looking at the target, and starting the next recognition if the user's sight deviates from the target for a long time, thereby ensuring that the user's looking at the target can be correctly recognized. The invention effectively reduces the loss phenomenon caused by deformation or high-speed movement of the target, does not need to repeatedly carry out repeated uploading identification on the same target, reduces the communication quantity between the target and the server, and reduces the pressure of the server.
    Description
Technical Field
      The invention belongs to the field of computer vision and deep learning, and particularly relates to an eye movement tracking dynamic target identification method and device based on a three-dimensional gaze point.
      Background method
      Most of the current three-dimensional fixation point-based recognition algorithms need to recognize each frame of image when recognizing objects, which greatly increases the amount of computation, and often encounters a problem of losing the object when facing a fast moving object. This not only affects the accuracy of the identification, but also severely constrains the real-time performance of the system. In addition, since the fixation points of the eyes of the user and the target to be identified have time lag, the target identification which is simply dependent on each three-dimensional fixation point can cause a large number of false positives, and the reliability of the system is reduced.
      The single-target tracker is integrated in eye movement tracking, so that the attention model of the user in a real scene can be better simulated, and the gazing target of the user can be accurately reflected. The application of the parameter to dynamic target recognition accords with the actual use experience of a user, the accuracy and the instantaneity of target recognition are obviously improved, and the method is particularly favorable for recognizing and tracking moving targets more stably in a dynamic environment.
    Disclosure of Invention
      Aiming at the problems, the invention provides an eye tracking dynamic identification method and device based on a three-dimensional gaze point, which are used for solving the problem that a user expects to look at a target inconsistent with an identification target and the problem of server blocking caused by repeated uploading identification.
      In order to achieve the purpose, the specific method scheme of the invention is as follows, namely an eye tracking dynamic identification method based on a three-dimensional fixation point comprises the following steps:
       s1, acquiring three-dimensional gaze point data of a user by using eye movement equipment, constructing a gaze point-time sequence, and screening out high-density three-dimensional gaze points of the user; 
       S2, establishing a space coordinate system based on eye movement equipment, wherein the center point of a binocular camera of the eye movement equipment of a user is taken as an original point, the horizontal direction of the user is taken as an X axis, the right side of the user is taken as the positive direction of the X axis, the vertical direction is taken as a Y axis, the upper side of the user is taken as the positive direction of the Y axis, the front and rear directions are taken as Z axes, and the positive direction of the Z axes is taken as the positive direction of the front; 
       s3, the position of the center point of the eyes of the user is E, the three-dimensional fixation point data is (x  Gaze fixation ,y Gaze fixation ,z Gaze fixation ), the position of the center point of the eyes of the user and the three-dimensional fixation point data are substituted into a space coordinate system based on eye movement equipment, a sight line equation is established to obtain a fixation sight line function L (x  Line of sight ,y Line of sight ,z Line of sight ) of the user, wherein the starting point of L (x  Line of sight ,y Line of sight ,z Line of sight ) is the position E of the center point of the eyes of the user, and the end point is the three-dimensional fixation point (x  Gaze fixation ,y Gaze fixation ,z Gaze fixation ); 
       S4, mapping a gaze line-of-sight function L (x  Line of sight ,y Line of sight ,z Line of sight ) into a foreground camera coordinate system to obtain a gaze point two-dimensional coordinate (x ', y') of a user; 
       S5, taking a fixation point two-dimensional coordinate (x ', y') as a center, selecting a rectangular area with a length and width set W.times.H pixels, and establishing an identification area initial coordinate (x ', y', W/2, H/2), wherein x ', y' represent the fixation point two-dimensional coordinate, W/2, H/2 represent the distance from the frame of the rectangular area to the center, and setting an image of the rectangular area with the W.times.H pixels as a template picture to be used for subsequent tracking; 
       S6, loading a weight file trained under a Yolov framework in advance by using a Pytorch framework, initializing a template picture identification model, and extracting deep features of the template picture by using the template picture model, wherein the deep features comprise low-layer features, middle-layer features, high-layer features and global features, and the global features aggregate the three previous features to generate context information with rich semantics for subsequent target classification and definition; 
       S7, inputting the extracted deep features of the template picture into a head network of a template picture recognition model to carry out classification tasks, and calculating probability distribution of each category by the head network to obtain a result with highest confidence as a recognition result of the template picture; 
       S8, sending the recognition result of the template picture to eye movement equipment, recording the received result by the eye movement equipment, comparing the received result with the last result, informing the user of the recognition result if the result is changed, and performing no operation if the result is not changed; 
       S9, initializing a single-target tracker based on a Siamese frame, extracting Jing Shipin frames before each frame from a 30FPS foreground video stream transmitted by eye movement equipment, and sequencing from small numbers to large numbers according to a time sequence. Inputting initial coordinates (x ', y', W/2, H/2) of the identification area and a template image into a template branch of a single-target tracker, inputting a foreground video frame with the minimum number into a search branch of the single-target tracker, and deleting the foreground video frame with the minimum number; 
       S10, carrying out feature extraction on a template branch and a search branch of a single-target tracker by adopting HRnet networks, and carrying out deep cross-correlation on an obtained template branch feature map and a search branch feature map to obtain a similarity response map; 
       S11, generating a plurality of anchor points (anchors) in a region with higher confidence in the similarity response graph, and respectively carrying out a regression task and a classification task on each anchor point to generate a tracking result; 
       S12, marking a tracking result in a video frame of a foreground image, completing a tracking process of a target image in a frame of foreground image picture, and repeating the steps S9 to S12 in a foreground video stream to realize a tracking effect of the target in the foreground video; 
       S13, while tracking the target in the foreground video is achieved, whether the two-dimensional coordinate (x ', y') of the gaze point of the user deviates from the tracking candidate frame for a long time is monitored, if the two-dimensional coordinate (x ', y') of the gaze point exceeds a set time threshold value and does not enter the candidate frame area, the user is judged to have lost attention to the target, the step S5 is returned to reselect the initial coordinate of the identification area and the template picture, and if the two-dimensional coordinate (x ', y') of the gaze point of the user does not deviate within the time threshold value, the last result is maintained. 
      Further, the step S1 includes the following:
       S1.1, organizing three-dimensional fixation point data into a time sequence according to a time stamp; 
       s1.2, screening out high-density fixation points in a period by using a density clustering algorithm, and establishing a group of fixation point-time sequences (x, y, z, t) by combining time information, wherein x, y, z represents the high-density three-dimensional fixation points of the screened user, and t represents a time stamp. 
      Further, the step S1.2 includes:
       s1.2.1 defines a space-time distance calculation formula, 
       Alpha represents a time weight coefficient;
       S1.2.2 processing a time sequence by using a DBSCAN algorithm, determining epsilon and MinPts values according to the distribution condition of a three-dimensional point of regard along with time, and comparing the number of points meeting the condition in the neighborhood of each data point with MinPts, wherein epsilon is a neighborhood radius and is used for defining a neighborhood range around one point, minPts is a minimum neighbor point number, and if the number of points around one point is greater than or equal to MinPts, the point is regarded as a core point; 
       s1.2.3 if one point is confirmed to be a core point and other points exist in the neighborhood of the point, the points belong to the same cluster, clusters which exceed a set quantity threshold are selected, and the points in the clusters represent the screened high-density three-dimensional fixation points of the user. 
      Further, the step S4 includes the following:
       s4.1, the foreground camera coordinate system takes the geometric center of an external camera lens arranged on the eye movement instrument as an origin, and the directions of X, Y and Z axes are consistent with the eye movement equipment space coordinate system; 
       S4.2, converting the user gazing sight line function L (x  Line of sight ,y Line of sight ,z Line of sight ) and the internal reference of the foreground camera, converting the gazing sight line function under the coordinate system of the eye movement device into the foreground sight line function K (x  Foreground of ,y Foreground of ,z Foreground of ) under the coordinate system of the foreground camera by utilizing a view transformation matrix, obtaining a conversion formula of K (x, y, z) =R+L (x, y, z) +T, 
      Wherein R is a 3×3 rotation matrix representing rotation of the eye movement device coordinate system relative to the foreground camera coordinate system, and T is a 3×1 translation vector representing the position of the eye movement device origin in the foreground camera coordinate system;
       S4.3 under the foreground camera coordinate system, the X-axis and Y-axis coordinates of the foreground Jing Shixian function K (X  Foreground of ,y Foreground of ,z Foreground of ) when Z  Foreground of  equals g are the gaze point two-dimensional coordinates (X ', Y ') of the user ' S three-dimensional gaze point on the foreground image, assuming that the foreground image curtain is parallel to the X-axis Y-axis under the foreground camera coordinate system, the Z-axis coordinates are g, according to the front Jing Shixian function K (X  Foreground of ,y Foreground of ,z Foreground of ). 
      Further, the step S7 includes the following:
       S7.1, the extracted deep features are transmitted to a head network of a template picture recognition model, and PAnet is used for further polymerizing and optimizing the multi-layer features so as to ensure that the model has global and local features; 
       S7.2, outputting confidence scores of each category according to the classification loss function yolov of the multi-layer characteristics after aggregation; 
       and S7.3, selecting the category with the highest confidence as a final template picture recognition result according to the confidence scores of all the categories. 
      Further, the step S10 includes the following:
       S10.1, the template branches and the search branches input by the system adopt a characteristic extraction network structure with shared weight, so that the template branches and the search branches are matched in the same characteristic space, and the backbone network of the high-resolution characteristic extraction network HRnet with the same structure is utilized to respectively extract the characteristics of the template branches and the search branch images so as to retain more detail information; 
       S10.2, after feature extraction is completed, performing deep cross-correlation operation on the obtained template branch feature map and the search branch feature map, wherein the deep cross-correlation operation is to perform convolution operation on the template branch feature map and the search branch feature map in a channel dimension channel by channel to obtain a two-dimensional similarity response map, and the value of each position represents the similarity between the template and the corresponding position of the search area. 
      Further, the step S11 includes the following:
       S11.1, based on the obtained similarity response graph, setting a threshold according to the percentage of the maximum value, and marking the area which is 80% higher than the maximum value as a high-confidence area; generating a plurality of anchor points in a high confidence coefficient region of the similarity response graph, wherein the anchor points are centered on the high confidence coefficient region, each anchor point is provided with a plurality of different scales and aspect ratios respectively so as to adapt to the shape and the size of a target, and the anchor points serve as a reference frame for target detection and provide a basis for subsequent regression and classification tasks; 
       S11.2, respectively carrying out a regression task and a classification task on each anchor point, wherein the regression task calculates one with highest regression score in a plurality of different scales and aspect ratios according to a regression loss function to be used as the size and aspect ratio of an anchor frame to enable the anchor frame to be closer to a target, and the classification branch generates a confidence score for each anchor point according to the classification loss function to represent the probability of the target contained in the anchor point, finds the anchor point with the highest score according to the score of the classification branch and takes a candidate frame of the anchor point as a tracking result. 
      The invention further provides an eye movement tracking dynamic identification device based on the three-dimensional gazing point, which comprises an eye state detection device, a two-dimensional gazing point mapping device, a target identification device, a single target tracking device, a gazing deviation monitoring device and an identification result broadcasting device, wherein the eye state detection device is used for screening the three-dimensional gazing point in a gazing state according to the three-dimensional gazing point provided by eye movement equipment and according to the gazing time and the gazing point distribution, the two-dimensional gazing point mapping device is responsible for converting the screened three-dimensional gazing point into corresponding two-dimensional coordinates and mapping the corresponding two-dimensional coordinates onto a foreground image corresponding to the two-dimensional gazing point, the target identification device is responsible for identifying an image of a user gazing point area, identifying a target through a local identification library or an online identification interface, extracting the target as a template picture, the single target tracking device is used for tracking the target, the single target tracking device is used for combining information of the tracked target with the user gazing point information, judging whether the area deviates from the target in a long time, and broadcasting the voice gazing point is used for broadcasting the identified result to a user, and the voice broadcasting is provided for a conventional operation.
      Compared with the prior art, the method has the advantages that the obtained result is more in line with the target expected to be watched by the user, the loss phenomenon caused by deformation or high-speed movement of the target can be effectively reduced, and even if the target is temporarily lost, the target can be still obtained again by combining the user's watching point. Meanwhile, the method does not need to repeatedly carry out repeated uploading identification on the same target, reduces the communication quantity between the server and the method, and reduces the pressure of the server.
    Drawings
      Fig. 1 is a flow chart of a method for tracking a dynamic target based on eye movement of a three-dimensional gaze point.
      Fig. 2 is a schematic diagram of an eye tracking dynamic target device structure based on a three-dimensional gaze point.
      FIG. 3 is a flow chart of a single target tracking device.
    Detailed Description
      The invention will be further described with reference to the drawings and the specific examples, it being pointed out that the method variant and the design principle of the invention will be described in more detail below with only one optimized method variant, but the scope of the invention is not limited thereto.
      The examples are preferred embodiments of the present invention, but the present invention is not limited to the above-described embodiments, and any obvious modifications, substitutions or variations that can be made by a person skilled in the art without departing from the spirit of the present invention are within the scope of the present invention.
      The invention relates to an eye movement tracking dynamic identification method based on a three-dimensional fixation point, which is shown in fig. 1 and comprises the following steps:
       s1, acquiring three-dimensional gaze point data of a user by using eye movement equipment, constructing a gaze point-time sequence, and screening out high-density three-dimensional gaze points of the user; 
       as a preferred embodiment of the present invention, step S1 includes the following: 
       S1.1, organizing three-dimensional fixation point data into a time sequence according to a time stamp; 
       S1.2, screening out high-density fixation points in a period by using a density clustering algorithm, and establishing a group of fixation point-time sequences (x, y, z, t) by combining time information, wherein x, y, z represents the screened out high-density three-dimensional fixation points of the user, and t represents a time stamp. 
      As a preferred embodiment of the present invention, step S1.2 comprises:
       s1.2.1 defines a space-time distance calculation formula, 
       Alpha represents a time weight coefficient;
       S1.2.2 processing a time sequence by using a DBSCAN algorithm, determining epsilon and MinPts values according to the distribution condition of a three-dimensional point of regard along with time, and comparing the number of points meeting the condition in the neighborhood of each data point with MinPts, wherein epsilon is a neighborhood radius and is used for defining a neighborhood range around one point, minPts is a minimum neighbor point number, and if the number of points around one point is greater than or equal to MinPts, the point is regarded as a core point; 
       s1.2.3 if one point is confirmed to be a core point and other points exist in the neighborhood of the point, the points belong to the same cluster, clusters which exceed a set quantity threshold are selected, and the points in the clusters represent the screened high-density three-dimensional fixation points of the user. 
      S2) establishing a space coordinate system based on eye movement equipment, wherein the center point of a binocular camera of the eye movement equipment of a user is taken as an original point, the horizontal direction of the user is taken as an X axis, the right side is taken as the positive direction of the X axis, the vertical direction is taken as a Y axis, the upper side is taken as the positive direction of the Y axis, the front-back direction is taken as a Z axis, and the positive direction of the Z axis is taken as the positive direction;
       s3) the position of the center point of the eyes of the user is E, the three-dimensional fixation point data is (x  Gaze fixation ,y Gaze fixation ,z Gaze fixation ), the position of the center point of the eyes of the user and the three-dimensional fixation point data are substituted into a space coordinate system based on eye movement equipment, a sight line equation is established to obtain a fixation sight line function L (x  Line of sight ,y Line of sight ,z Line of sight ) of the user, wherein the starting point of L (x  Line of sight ,y Line of sight ,z Line of sight ) is the position E of the center point of the eyes of the user, and the end point is the three-dimensional fixation point (x  Gaze fixation ,y Gaze fixation ,z Gaze fixation ); 
       S4, mapping the gaze point sight line function L (x  Line of sight ,y Line of sight ,z Line of sight ) into a foreground camera coordinate system to obtain the gaze point two-dimensional coordinates (x ', y') of the user. 
      As a preferred embodiment of the present invention, step S4 includes the following:
       s4.1, the foreground camera coordinate system takes the geometric center of an external camera lens arranged on the eye movement instrument as an origin, and the directions of X, Y and Z axes are consistent with the eye movement equipment space coordinate system; 
       S4.2, converting the user gazing sight line function L (x  Line of sight ,y Line of sight ,z Line of sight ) and the internal reference of the foreground camera, converting the gazing sight line function under the coordinate system of the eye movement device into the foreground sight line function K (x  Foreground of ,y Foreground of ,z Foreground of ) under the coordinate system of the foreground camera by utilizing a view transformation matrix, obtaining a conversion formula of K (x, y, z) =R+L (x, y, z) +T, 
      Wherein R is a 3×3 rotation matrix representing rotation of the eye movement device coordinate system relative to the foreground camera coordinate system, and T is a 3×1 translation vector representing the position of the eye movement device origin in the foreground camera coordinate system;
       S4.3 under the foreground camera coordinate system, the X-axis and Y-axis coordinates of the foreground Jing Shixian function K (X  Foreground of ,y Foreground of ,z Foreground of ) when Z  Foreground of  equals g are the gaze point two-dimensional coordinates (X ', Y ') of the user ' S three-dimensional gaze point on the foreground image, assuming that the foreground image curtain is parallel to the X-axis Y-axis under the foreground camera coordinate system, the Z-axis coordinates are g, according to the front Jing Shixian function K (X  Foreground of ,y Foreground of ,z Foreground of ). 
      S5, taking a fixation point two-dimensional coordinate (x ', y') as a center, selecting a rectangular area with a length and width set W.times.H pixels, and establishing an identification area initial coordinate (x ', y', W/2, H/2), wherein x ', y' represent the fixation point two-dimensional coordinate, W/2, H/2 represent the distance from the frame of the rectangular area to the center, and setting an image of the rectangular area with the W.times.H pixels as a template picture to be used for subsequent tracking;
       S6, loading a weight file trained under a Yolov framework in advance by using a Pytorch framework, initializing a template picture identification model, and extracting deep features of the template picture by using the template picture model, wherein the deep features comprise low-layer features, middle-layer features, high-layer features and global features, and the global features aggregate the three previous features to generate context information with rich semantics for subsequent target classification and definition. 
      As a preferred embodiment of the invention, the invention uses Pytorch frames to load weight files that have been trained in advance under Yolov8 frames, initializing a template picture model. The weight file mainly contains the following four contents:
       s6.1 model architecture information comprises specific layers of a network, and parameters of each layer comprise channel number, stride and convolution kernel size. 
      S6.2, loss function related parameters, including the regression loss of the boundary box, are used for predicting the coordinates and the size of the boundary box. And judging whether an object exists at a certain position or not according to the target confidence loss. And the classification loss is responsible for detecting classification tasks of the category.
      S6.3, detecting head parameters, which are used for boundary box prediction and comprise weights of a classification head, a confidence coefficient head and a regression head.
      And S6.4, training state information, including training super-parameters such as learning rate, batch Size and WEIGHT DECAY.
      As a preferred embodiment of the invention, the template picture is input into the template picture model for preprocessing, the image is scaled up equally, and it is adjusted (640) to black fill on the shorter side to complement the size. And carrying out normalization operation on the amplified pictures, and adjusting the channel sequence to be (3,640,640) through dimension conversion so as to adapt to the input requirement of the model. Inputting the preprocessed image into a template picture recognition model, and extracting deep features through a backbone network of the template picture recognition model, wherein the deep features comprise low-layer features, middle-layer features, high-layer features and global features. The low-layer feature resolution is higher, and rich edge and texture information is reserved. The middle layer features are used to extract the contour, shape and local structural information of the object. The high-level features are used to identify the class of the object, background information. The global features aggregate the three previous features to generate semantically rich context information for subsequent target classification and definition.
      S7, inputting the extracted deep features of the template picture into a head network of a template picture recognition model to carry out classification tasks, and calculating probability distribution of each category by the head network to obtain a result with highest confidence as a recognition result of the template picture.
      As a preferred embodiment of the present invention, step S7 includes the following:
       S7.1, the extracted deep features are transmitted to a head network of a template picture recognition model, and PAnet is used for further polymerizing and optimizing the multi-layer features so as to ensure that the model has global and local features; 
       S7.2, outputting confidence scores of each category according to the classification loss function yolov of the multi-layer characteristics after aggregation; 
       and S7.3, selecting the category with the highest confidence as a final template picture recognition result according to the confidence scores of all the categories. 
      S8, sending the recognition result of the template picture to eye movement equipment, recording the received result by the eye movement equipment, comparing the received result with the last result, informing the user of the recognition result if the result is changed, and performing no operation if the result is not changed;
       S9, initializing a single-target tracker based on a Siamese frame, extracting Jing Shipin frames before each frame from a 30FPS foreground video stream transmitted by eye movement equipment, and sequencing from small numbers to large numbers according to a time sequence. Inputting initial coordinates (x ', y', W/2, H/2) of the identification area and a template image into a template branch of a single-target tracker, inputting a foreground video frame with the minimum number into a search branch of the single-target tracker, and deleting the foreground video frame with the minimum number; 
       and S10, extracting features of a template branch and a search branch of the single-target tracker by adopting HRnet networks, and performing deep cross-correlation on the obtained template branch feature map and the obtained search branch feature map to obtain a similarity response map. 
      As a preferred embodiment of the present invention, step S10 includes the following:
       S10.1, the template branches and the search branches input by the system adopt a characteristic extraction network structure with shared weight, so that the template branches and the search branches are matched in the same characteristic space, and the backbone network of the high-resolution characteristic extraction network HRnet with the same structure is utilized to respectively extract the characteristics of the template branches and the search branch images so as to retain more detail information; 
       S10.2, after feature extraction is completed, performing deep cross-correlation operation on the obtained template branch feature map and the search branch feature map, wherein the deep cross-correlation operation is to perform convolution operation on the template branch feature map and the search branch feature map in a channel dimension channel by channel to obtain a two-dimensional similarity response map, and the value of each position represents the similarity between the template and the corresponding position of the search area. 
      S11, generating a plurality of anchor points (anchors) in a region with higher confidence in the similarity response graph, and respectively carrying out a regression task and a classification task on each anchor point to generate a tracking result;
       As a preferred embodiment of the present invention, step S11 includes the following: 
       S11.1, based on the obtained similarity response graph, setting a threshold according to the percentage of the maximum value, and marking the area which is 80% higher than the maximum value as a high-confidence area; generating a plurality of anchor points in a high confidence coefficient region of the similarity response graph, wherein the anchor points are centered on the high confidence coefficient region, each anchor point is provided with a plurality of different scales and aspect ratios respectively so as to adapt to the shape and the size of a target, and the anchor points serve as a reference frame for target detection and provide a basis for subsequent regression and classification tasks; 
       S11.2, respectively carrying out a regression task and a classification task on each anchor point, wherein the regression task calculates one with highest regression score in a plurality of different scales and aspect ratios according to a regression loss function to be used as the size and aspect ratio of an anchor frame to enable the anchor frame to be closer to a target, and the classification branch generates a confidence score for each anchor point according to the classification loss function to represent the probability of the target contained in the anchor point, finds the anchor point with the highest score according to the score of the classification branch and takes a candidate frame of the anchor point as a tracking result. 
      S12, marking a tracking result in a video frame of a foreground image, completing a tracking process of a target image in a frame of foreground image picture, and repeating S9 to S12 in a foreground video stream to realize a tracking effect of the target in the foreground video;
       S13, while tracking the target in the foreground video is achieved, whether the two-dimensional coordinate (x ', y') of the gaze point of the user deviates from the tracking candidate frame for a long time is monitored, if the two-dimensional coordinate (x ', y') of the gaze point exceeds a set time threshold value and does not enter the candidate frame area, the user is judged to have lost attention to the target, the step S5 is returned to reselect the initial coordinate of the identification area and the template picture, and if the two-dimensional coordinate (x ', y') of the gaze point of the user does not deviate within the time threshold value, the last result is maintained. 
      The invention further provides an eye movement tracking dynamic identification device based on the three-dimensional gazing point, which comprises an eye state detection device, a two-dimensional gazing point mapping device, a target identification device, a single target tracking device, a gazing deviation monitoring device and an identification result broadcasting device, wherein the eye state detection device is used for screening the three-dimensional gazing point in a gazing state according to the three-dimensional gazing point provided by eye movement equipment and according to the gazing time and the gazing point distribution, the two-dimensional gazing point mapping device is responsible for converting the screened three-dimensional gazing point into corresponding two-dimensional coordinates and mapping the corresponding two-dimensional coordinates onto a foreground image corresponding to the two-dimensional gazing point, the target identification device is responsible for identifying an image of a user gazing point area, identifying a target through a local identification library or an online identification interface, extracting the target as a template picture, the single target tracking device is used for tracking the target, the single target tracking device is used for combining information of the tracked target with the user gazing point information, judging whether the area deviates from the target in a long time, and broadcasting the voice gazing point is used for broadcasting the identified result to a user, and the voice broadcasting is provided for a conventional operation.
      As a preferred embodiment of the present invention, as shown in fig. 3, the implementation of the single-object tracking device of the present invention includes the following:
       1) The improved Siamese model is loaded, and the initial coordinates (x ', y', W/2, H/2) of the identification area and the template image are input into the template branch of the single-target tracker. And inputting the foreground video frame with the minimum number into a searching branch of the single-target tracker, and deleting the foreground video frame with the minimum number. 
      2) The system integration HRnet network performs feature extraction on the template image and the search image respectively to retain more high-resolution detailed information.
      3) And performing depth cross-correlation operation on the obtained template branch feature map and the search branch feature map to generate a two-dimensional similarity response map, wherein the value of each position represents the similarity between the template and the corresponding position of the search area.
      4) And screening out a high-confidence-degree region according to a set threshold value based on the obtained similarity response graph, and generating a plurality of anchor points in the high-confidence-degree region for subsequent target positioning.
      5) And respectively carrying out a regression task and a classification task for each anchor point, wherein the regression task determines the size and the aspect ratio of the anchor frame. The classification task finds the anchor point with the highest score and takes its candidate frame as the tracking result.
      6) And outputting a final result, continuously inputting the foreground video frame with the minimum number into a searching branch of the single-target tracker, and repeating the processes from the first step to the sixth step.
    Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202510500113.6A CN120526469A (en) | 2025-04-21 | 2025-04-21 | Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202510500113.6A CN120526469A (en) | 2025-04-21 | 2025-04-21 | Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point | 
Publications (1)
| Publication Number | Publication Date | 
|---|---|
| CN120526469A true CN120526469A (en) | 2025-08-22 | 
Family
ID=96746201
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202510500113.6A Pending CN120526469A (en) | 2025-04-21 | 2025-04-21 | Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN120526469A (en) | 
- 
        2025
        - 2025-04-21 CN CN202510500113.6A patent/CN120526469A/en active Pending
 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| US11830230B2 (en) | Living body detection method based on facial recognition, and electronic device and storage medium | |
| CN109766830B (en) | Ship target identification system and method based on artificial intelligence image processing | |
| CN109829398B (en) | A method for object detection in video based on 3D convolutional network | |
| CN111310731B (en) | Video recommendation method, device, equipment and storage medium based on artificial intelligence | |
| Pang et al. | Visual haze removal by a unified generative adversarial network | |
| US7324693B2 (en) | Method of human figure contour outlining in images | |
| CN111062263B (en) | Method, apparatus, computer apparatus and storage medium for hand gesture estimation | |
| CN108960059A (en) | A kind of video actions recognition methods and device | |
| KR100799990B1 (en) | 3D image conversion device and method of 2D image | |
| Fang et al. | Deep3DSaliency: Deep stereoscopic video saliency detection model by 3D convolutional networks | |
| CN112329662B (en) | Multi-view saliency estimation method based on unsupervised learning | |
| CN114663835B (en) | Pedestrian tracking method, system, equipment and storage medium | |
| CN116563340B (en) | Visual SLAM method based on deep learning under dynamic environment | |
| CN113591545B (en) | Deep learning-based multi-level feature extraction network pedestrian re-identification method | |
| JP7312026B2 (en) | Image processing device, image processing method and program | |
| JP4166143B2 (en) | Face position extraction method, program for causing computer to execute face position extraction method, and face position extraction apparatus | |
| CN113688804B (en) | Multi-angle video-based action identification method and related equipment | |
| CN119729207A (en) | Photographic focusing control method based on machine vision | |
| CN112926667A (en) | Method and device for detecting saliency target of depth fusion edge and high-level feature | |
| CN114882537A (en) | Finger new visual angle image generation method based on nerve radiation field | |
| CN111145361A (en) | Naked eye 3D display vision improving method | |
| CN108932532A (en) | A kind of eye movement data number suggesting method required for the prediction of saliency figure | |
| US20250225175A1 (en) | Object search via re-ranking | |
| CN120526469A (en) | Eye movement tracking dynamic target identification method and device based on three-dimensional fixation point | |
| CN114155273B (en) | Video image single-target tracking method combining historical track information | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |