[go: up one dir, main page]

CN119183554A - Context-based gesture recognition - Google Patents

Context-based gesture recognition Download PDF

Info

Publication number
CN119183554A
CN119183554A CN202280095839.0A CN202280095839A CN119183554A CN 119183554 A CN119183554 A CN 119183554A CN 202280095839 A CN202280095839 A CN 202280095839A CN 119183554 A CN119183554 A CN 119183554A
Authority
CN
China
Prior art keywords
gesture
image
confidence score
confidence
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280095839.0A
Other languages
Chinese (zh)
Inventor
刘杰
周扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Innopeak Technology Inc
Original Assignee
Innopeak Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology Inc filed Critical Innopeak Technology Inc
Publication of CN119183554A publication Critical patent/CN119183554A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请旨在识别图像中的手势。电子设备获得包括手部区域的图像并检测图像中的手部区域。从图像的手部区域确定第一手势,并从图像确定第二手势。根据确定第一手势不是多个上下文手势中的任何手势,电子设备确定图像的最终手势是第一手势。相反,根据确定第一手势是多个上下文手势之一,电子设备基于与图像关联的第二手势和相应的置信度分数从图像确定最终手势。

The present application is directed to recognizing gestures in an image. An electronic device obtains an image including a hand region and detects the hand region in the image. A first gesture is determined from the hand region of the image, and a second gesture is determined from the image. Based on determining that the first gesture is not any of a plurality of contextual gestures, the electronic device determines that a final gesture of the image is the first gesture. Conversely, based on determining that the first gesture is one of a plurality of contextual gestures, the electronic device determines a final gesture from the image based on a second gesture associated with the image and a corresponding confidence score.

Description

Context-based gesture recognition
Technical Field
The present application relates generally to gesture recognition techniques, including but not limited to methods, systems, and non-transitory computer-readable media for recognizing gestures from image data.
Background
Existing gesture recognition solutions typically require capturing gestures (e.g., gestures, facial movements, etc.) close to a camera (e.g., less than 1 meter apart) to accurately detect the gestures. As the distance between the gesture and the imaging device increases, the gesture becomes smaller and blends with many other objects captured simultaneously with the gesture. To overcome these challenges, electronic devices applied to gesture recognition are adjusted to focus on a small area containing a gesture, for example, by zooming in and limiting the field of view to a small area. However, some electronic devices cannot focus on a particular gesture before capturing an image and require cropping the image to obtain a small area containing the area. Other solutions rely on powerful deep learning models or fusion of detection and classification processes, thus requiring significant computational resources for gesture detection and recognition. It would therefore be beneficial to have a system and method that can accurately and efficiently detect gestures captured in an image that includes gestures that may be far from the camera and that are mixed with the image background.
Disclosure of Invention
Various embodiments of the present application are directed to gesture recognition techniques that fuse local gesture information with contextual gesture information to improve accuracy and efficiency of gesture recognition. The local gesture information includes information about the gesture or body part performing the gesture, the contextual gesture information includes information about the gesture surroundings, such as the environment (e.g., office), the location of the gesture relative to the user, and/or other factors. Furthermore, in some embodiments, initial gesture classification is applied to simplify the detection and classification process, thereby improving overall efficiency. In some embodiments, one or more gestures are identified based on the contextual information (e.g., moving a hand to the mouth to represent silence). Such contextual information in the image serves to improve the accuracy of gesture recognition and reduce the number of false positives. In an example, the context information is used to distinguish between a local gesture, a contextual gesture, and/or a non-gesture.
In one aspect, a gesture classification method is provided. The method includes obtaining an image including a hand region, detecting the hand region in the image, determining a first gesture from the hand region of the image, and determining a second gesture from the image (e.g., the entire image). The method further includes determining that the final gesture of the image is a first gesture in accordance with determining that the first gesture is not any of the plurality of contextual gestures, and determining that the final gesture is one of the plurality of contextual gestures in accordance with determining that the first gesture is based on a second gesture and a second confidence score, the second gesture and the second confidence score being associated with the image (e.g., the entire image).
In some embodiments, determining the first gesture from the hand region of the image further comprises generating a first gesture vector from the hand region of the image. Each element of the first gesture vector corresponds to a respective gesture and represents a respective first confidence level for a hand region that includes the respective gesture. The method also includes determining a first gesture and a first gesture confidence score from the first gesture vector. In some embodiments, the method further includes associating a detection of a hand region in the image with the bounding box confidence score, combining the bounding box confidence score with the confidence score associated with the first gesture to generate the first gesture confidence score. In some embodiments, the first gesture comprises a respective gesture corresponding to a maximum first confidence level of the respective first confidence levels of each element of the first gesture vector, and the first gesture confidence score is equal to the maximum first confidence level of the respective first confidence levels of each element of the first gesture vector.
In some embodiments, determining the second gesture from the image further includes generating a second gesture vector from the image (e.g., the entire image). Each element of the second gesture vector corresponds to a respective gesture and represents a respective second confidence level of the image comprising the respective predefined gesture. The method also includes determining a second gesture and a second gesture confidence score from the second gesture vector. In some embodiments, the second gesture includes a respective gesture corresponding to a largest second confidence level of the respective second confidence levels of each element of the second gesture vector, and the second gesture confidence score is equal to the largest second confidence level of the respective second confidence levels of each element of the second gesture vector.
In some embodiments, the method includes, prior to determining whether the first gesture is at least one of the plurality of contextual gestures, determining whether the first gesture confidence score is greater than a second threshold P2, and in accordance with determining that the first gesture confidence score is less than the second threshold P2, determining that the image is not associated with any gesture. In some embodiments, the method further includes, prior to determining whether the first gesture is at least one of the plurality of contextual gestures, determining whether a second gesture confidence score of the second gesture is greater than a first threshold P1, and in accordance with determining that the second gesture confidence score of the second gesture is less than the first threshold P1, determining that the image is not associated with any gesture.
In some embodiments, determining the final gesture based on the second gesture and the second confidence score further includes determining that the image is not associated with any gesture in accordance with determining that the first gesture and the second gesture are different from each other. The method further includes, in accordance with a determination that the first gesture and the second gesture are identical to each other, (1) in accordance with a determination that the third confidence score exceeds the integrated confidence threshold, determining that the final gesture is the second gesture, and (2) in accordance with a determination that the third confidence score does not exceed the integrated confidence threshold, determining that the image is not associated with any gesture.
In some embodiments, the method further includes filtering the final gesture using a filter function for identifying false positives with the aid of time information (i.e., results from previous images). In some embodiments, the filter function is one of a convolution function, a fourier filter function, or a kalman filter. In some embodiments, the filter function is a function of time.
In another aspect, some embodiments include an electronic device comprising one or more processors and a memory having instructions stored thereon, which when executed by the one or more processors, cause the processors to perform any of the methods described above.
In another aspect, some embodiments include a non-transitory computer-readable medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform any of the methods described above.
These exemplary embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding of the disclosure. Further examples are discussed in the detailed description and further description is provided.
Drawings
For a better understanding of the various embodiments described, reference should be made to the following detailed description taken in conjunction with the accompanying drawings in which like reference numerals refer to corresponding parts throughout the drawings.
FIG. 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, according to some embodiments.
Fig. 2 is a block diagram illustrating an electronic device for processing content data (e.g., image data) in accordance with some embodiments.
FIG. 3 is a flow diagram of a gesture detection and classification process using image data, according to some embodiments.
FIG. 4 is a flowchart of an example post-processing technique to determine a final gesture from two gestures determined from image data, according to some embodiments.
FIG. 5 is a flow chart of a method of classifying one or more gestures, according to some embodiments.
Like reference numerals designate corresponding parts throughout the several views of the drawings.
Detailed Description
Reference will now be made in detail to the specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to provide an understanding of the subject matter presented herein. It will be apparent, however, to one skilled in the art that various alternatives may be used without departing from the scope of the claims, and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein may be implemented on many types of electronic devices having digital video capabilities.
FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, according to some embodiments. The one or more client devices 104 may be, for example, a desktop computer 104A, a tablet computer 104B, a mobile phone 104C, a head-mounted display (HMD) (also referred to as augmented reality (augmented reality, AR) glasses) 104D, or a smart multi-sensor networking home device (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 may collect data or user input, execute user applications, and present output on its user interface. The collected data or user input may be processed locally at the client device 104 and/or remotely by the server 102. One or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to client devices 104, and in some embodiments, process data and user inputs received from client devices 104 when the user applications are executed on client devices 104. In some embodiments, the data processing environment 100 also includes a memory 106, the memory 106 for storing data related to the server 102, the client device 104, and applications executing on the client device 104.
The one or more servers 102 are operable to enable real-time data communication with client devices 104 that are remote from each other or client devices 104 that are remote from the one or more servers 102. Further, in some embodiments, one or more servers 102 are configured to perform data processing tasks that cannot be accomplished locally by or that would not be prioritized by client device 104. For example, the client device 104 includes a game console (e.g., HMD 104D) executing an interactive online game application. The game console receives the user instructions and sends them to the game server 102 along with the user data. The game server 102 generates a video data stream based on the user instructions and user data and provides the video data stream for display on the game console and other client devices that participate in the same game session with the game console. In another example, the client device 104 includes a networked monitoring camera 104E and a mobile phone 104C. The networked monitoring camera 104E collects video data and transmits the video data to the monitoring camera server 102 in real time. Although the video data is optionally pre-processed by the monitoring camera 104E, the monitoring camera server 102 processes the video data to identify motion or audio events in the video data and shares information of those events with the mobile phone 104C, thereby allowing the user of the mobile phone 104 to remotely monitor in real time events occurring in the vicinity of the networked monitoring camera 104E.
One or more servers 102, one or more client devices 104, and memory 106 are communicatively coupled to one another via one or more communication networks 108, the communication networks 108 being media used to provide communication links between these devices within the data processing environment 100 and computers connected together. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include a local area network (local area network, LAN), a wide area network (wide area network, WAN) such as the internet, or a combination thereof. One or more of the communication networks 108 may alternatively be implemented using any known network protocol, including various wired or wireless protocols such as Ethernet, universal serial bus (universal serial bus, USB), firewire, long term evolution (long term evolution, LTE), global system for mobile communications (global system for mobile communications, GSM), enhanced data GSM environment (ENHANCED DATA GSM environment, EDGE), code division multiple access (code division multiple access, CDMA), time division multiple access (time division multiple access, TDMA), bluetooth, wi-Fi, voice over Internet protocol (voice over Internet protocol, voIP), wi-MAX, or any other suitable communication protocol. Connections to one or more communication networks 108 may be established directly (e.g., using 3G/4G connections to wireless carriers), or through a network interface 110 (e.g., a router, switch, gateway, hub, or intelligent private home control node), or through any combination thereof. As such, the one or more communication networks 108 may represent the internet of a worldwide collection of networks and gateways that use the transmission control protocol/internet protocol (transmission control protocol/Internet protocol, TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages.
In some embodiments, deep learning techniques are applied to the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executing at the client device 104, to identify information contained in the content data, to match the content data with other data, to classify the content data, or to synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensors of the client device 104. In these deep learning techniques, a data processing model is created based on one or more neural networks to process content data. These data processing models are trained with training data before being applied to process content data. After model training, the mobile phone 104C or HMD 104D obtains content data (e.g., captures video data through an internal camera) and processes the content data locally using a data processing model.
In some embodiments, model training and data processing is implemented locally on each individual client device 104 (e.g., mobile phone 104C and HMD 104D). The client device 104 obtains training data from one or more servers 102 or memories 106 and applies the training data to train the data processing model. Or in some embodiments, model training and data processing is implemented remotely on a server 102 (e.g., server 102A) associated with a client device 104 (e.g., client device 104A and HMD 104D). The server 102A obtains training data from itself, another server 102, or the memory 106 and applies the training data to train the data processing model. The client device 104 obtains content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing model, receives data processing results (e.g., recognized gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the gestures, or implements some other function based on the results. The client device 104 itself does no or little data processing of the content data prior to sending the content data to the server 102A. Further, in some embodiments, data processing is implemented locally on the client device 104 (e.g., client device 104B and HMD 104D), while model training is implemented remotely on a server 102 (e.g., server 102B) associated with the client device 104. Server 102B obtains training data from itself, another server 102, or memory 106 and applies the training data to train the data processing model. The trained data processing model is optionally stored in server 102B or memory 106. Client device 104 imports a trained data processing model from server 102B or memory 106, processes content data using the data processing model, and generates data processing results for presentation on a user interface or for locally launching functions (e.g., rendering virtual objects based on device gestures).
In some embodiments, a pair of AR glasses 104D (also referred to as HMDs) are communicatively coupled in the data processing environment 100. AR glasses 104D include a camera, microphone, speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. A camera and microphone are used to capture video and audio data from the scene in which AR glasses 104D are located, while one or more inertial sensors are used to capture inertial sensor data. In some cases, the camera captures gestures of the user wearing the AR glasses 104D and recognizes the gestures locally in real-time using a gesture recognition model. In some cases, the microphone records ambient sounds, including voice commands of the user. In some cases, both video or still visual data captured by the camera and inertial sensor data measured by one or more inertial sensors are used to determine and predict device pose. Video, still images, audio, or inertial sensor data captured by the AR glasses 104D, the server 102, or both, are processed by the AR glasses 104D to recognize device gestures. Optionally, server 102 and AR glasses 104D together apply deep learning techniques to recognize and predict device gestures. The device gestures are used to control the AR glasses 104D itself or to interact with applications (e.g., gaming applications) executed by the AR glasses 104D. In some embodiments, the display of AR glasses 104D displays a user interface, and the recognized or predicted device gestures are used to render or interact with a user-selectable display item (e.g., avatar) on the user interface.
As described above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, still image data, or inertial sensor data captured by the AR glasses 104D. A 2D or 3D device pose is identified and predicted based on such video, still image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using the second data processing model. Training of the first data processing model and the second data processing model is optionally performed by the server 102 or the AR glasses 104D. The inference of device gestures and visual content is accomplished by the server 102 and AR glasses 104D each independently or by the server 102 and AR glasses 104D together.
Fig. 2 is a block diagram illustrating an electronic device 200 for processing content data (e.g., image data) in accordance with some embodiments. The electronic device 200 is one of the server 102, the client device 104 (e.g., AR glasses 104D in fig. 1), the memory 106, or a combination thereof. In an example, the electronic device 200 is a mobile device that includes a gesture recognition module 230, the gesture recognition module 230 applying a neural network model (e.g., in fig. 3) end-to-end to locally recognize gestures at the mobile device. The electronic device 200 typically includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic device 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, mouse, voice command input unit or microphone, touch screen display, touch-sensitive tablet, gesture-capturing camera, or other input buttons or controls. Further, in some embodiments, the electronic device 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace a keyboard. In some embodiments, the electronic device 200 includes one or more optical cameras (e.g., RGB cameras 260), scanners, or light sensor units for capturing images of, for example, a graphical serial code printed on the electronic device. In some embodiments, the electronic device 200 also includes one or more output devices 212 for presenting user interfaces and displaying content, including one or more speakers and/or one or more visual displays. Optionally, the electronic device 200 comprises a location detection device, such as a GPS (global positioning system) or other geographical location receiver, for determining the location of the electronic device 200. Optionally, the electronic device 200 comprises an inertial measurement unit (inertial measurement unit, IMU) 280, the IMU 280 integrating sensor data captured by the multi-axis inertial sensor to estimate the position and orientation of the electronic device 200 in space. Examples of one or more inertial sensors of IMU 280 include, but are not limited to, gyroscopes, accelerometers, magnetometers, and inclinometers.
Alternatively or in addition, in some embodiments, the electronic device 200 is communicatively coupled to one or more devices (e.g., server 102, client device 104, memory 106, or a combination thereof) including one or more input devices 210, output devices 212, IMU 280, or other components described above, via one or more network interfaces 204, and provides data to the electronic device 200.
Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices, and optionally includes non-volatile memory, such as one or more magnetic disk memory, one or more optical disk memory, one or more flash memory devices, or one or more other non-volatile solid state memory. Optionally, the memory 206 includes one or more memories remote from the one or more processing units 202. The memory 206 or alternatively the non-volatile memory within the memory 206 includes a non-transitory computer-readable storage medium. In some embodiments, memory 206 or a non-transitory computer readable storage medium of memory 206 stores the following programs, modules, and data structures, or a subset or superset thereof:
An operating system 214 including processes for handling various basic system services and for performing hardware-related tasks;
A network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or memory 106) via one or more (wired or wireless) network interfaces 204 and one or more communication networks 108, such as the internet, other wide area networks, local area networks, metropolitan area networks, etc.;
A user interface module 218 for enabling presentation of information (e.g., graphical user interfaces of applications 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., display, speaker, etc.);
an input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected inputs or interactions;
a web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and their web pages, including a network interface for logging into a user account associated with the client device 104 or another electronic device, for controlling the client or electronic device when associated with the user account, and for editing and viewing settings and data associated with the user account;
One or more user applications 224 executed by the electronic device 200 (e.g., games, social networking applications, smart home applications, and/or other web-based or non-web-based applications for controlling another electronic device and viewing data captured by such device);
model training module 226 for receiving training data and building a data processing model for processing content data (e.g., video, image, audio, or text data) to be collected or obtained by client device 104;
A data processing module 228 for processing the content data using the data processing model 250 to identify information contained in the content data, match the content data with other data, classify the content data, or synthesize related content data, wherein in some embodiments the data processing module 228 is associated with one of the user applications 224 to process the content data in response to user instructions received from the user application 224, in an example the data processing module 228 is applied to implement the gesture detection and classification process 300 in fig. 3;
A gesture classification module 230 for classifying one or more gestures in the image (as shown and described below with reference to fig. 3 and 4), wherein the gesture recognition module 230 further comprises a detection module 232 for detecting one or more objects in the image and/or a classification module 234 for classifying one or more gestures in a region or portion of the image and/or the entire image, the image data being processed jointly by the detection process 310 and the classification process 320 of the gesture recognition module 230 and the data processing module 228;
one or more databases 240 for storing at least data including one or more of:
device settings 242, including common device settings (e.g., service layer, device model, storage capacity, processing power, communication power, etc.) for one or more of server 102 or client device 104;
User account information 244 for one or more user applications 224, such as user name, security questions, account history data, user preferences, and predefined account settings;
Network parameters 246 of one or more communication networks 108, such as IP address, subnet mask, default gateway,
DNS server, and hostname;
training data 248 for training one or more data processing models 250;
a data processing model 250 for processing content data (e.g., video, image, audio, or text data) using a deep learning technique, wherein the data processing model 250 includes an image compression model for implementing an image compression process, a feature extraction model for implementing a multi-scale feature extraction process, and/or one or more classification models and networks as described below with reference to fig. 3 and 4;
gesture database 252 for storing one or more gestures associated with an image (e.g., stored in a database
(E.g., in memory 204), and
The content data and results 254 are obtained and output by the electronic device 200 (or a device communicatively coupled to the electronic device 200 (e.g., the client device 104)), respectively, where the content data is processed locally at the client device 104 or remotely at the server 102 by the data processing model 250 to provide relevant results to be presented on the client device 104.
Optionally, one or more databases 240 are stored at one of the server 102, the client device 104, and the memory 106 of the electronic device 200. Optionally, one or more databases 240 are distributed among more than one of the server 102, client device 104, and memory 106 of the electronic device 200. In some embodiments, more than one copy of the above data is stored at different devices, e.g., two copies of the data processing model 250 are stored at the server 102 and the memory 106, respectively.
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices and correspond to a set of instructions for performing the above described functions. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules, or data structures, and thus various subsets of these modules may be combined or otherwise rearranged in various embodiments. In some embodiments, memory 206 optionally stores a subset of the modules and data structures identified above. Further, memory 206 may optionally store additional modules and data structures not described above.
FIG. 3 is a flow diagram of a gesture detection and classification process 300 using image data 312, according to some embodiments. In some embodiments, gesture detection and classification process 300 is used to detect and identify gestures captured at a distance of at least 0.5 meters to 2 meters from the imaging device capturing image data 312. In some embodiments, gesture detection and classification process 300 is used to detect and identify gestures captured more than 2 meters from the imaging device capturing image data 312. The gesture detection and classification process 300 is optionally performed by one or more of the client devices 104, the server 102, and/or combinations thereof described above with reference to fig. 1 and 2. Gesture detection and classification process 300 includes a detection process 310, followed by a classification process 320, followed by a post-processing stage 330 that determines a final gesture 340.
In some embodiments, the detection process 310 includes applying a first detection and classification model 314 to the received image data 312. The first detection and classification model 314 generates one or more feature graphs 302 for object detection in an object detection stage 316 and gesture classification in a gesture classification stage 318. The detection process 310 is used to provide a first output through an object detection stage 316 and a second output through a gesture classification stage 318 to determine a final gesture 340. The first output of the object detection stage 316 includes information of bounding boxes 303 and associated box confidence scores 304 for each first gesture 305. In some embodiments, the information of bounding box 303 is used to generate a crop image 322 of the close-bounding gesture, and a second classification network 324 is applied to determine the first gesture 305 from the crop image 322. The second output of the gesture classification stage 318 includes information of the second gesture 307 and the associated second confidence score 308.
In some embodiments, the image data 312 is captured by the input device 210 (e.g., the RGB camera 260) of the electronic device 200 (fig. 2). The image data 312 is optionally processed locally at the electronic device 200. Or the image data 312 is uploaded to the server 102 or transmitted to a different electronic device 200. The different electronic device 200 obtains the image 312 from the electronic device 200 with the camera 260 or downloads the image 312 from the server 102 through the web browser module 222 and/or one or more user applications 224. In some embodiments, image data 312 includes one or more gestures. In some embodiments, the gesture in the image data 312 is at least 4 meters from the electronic device 200 capturing the image data 312. Non-limiting examples of one or more gestures include one or more gestures, facial actions, and body gestures. Image data 312 is received at an initial resolution (e.g., 1080 p).
The image data 312 is passed through a first detection and classification model 314 to compress the image data 312 and/or to generate one or more feature maps 302 from the image data 312. In some embodiments, the image data 312 is processed (e.g., scaled down using one or more neural networks (e.g., one or more convolutional neural networks)) before passing through the first detection and classification model 314. The first detection and classification model 314 includes one or more machine learning models. For example, in some embodiments, the first detection and classification model 314 includes one or more convolutional neural networks (convolution neural network, CNN) known in the art. In some embodiments, one or more machine learning models are used to identify and enrich (e.g., extract details from) one or more features in the image data 312, and narrow down (relative to an initial resolution of the image data 312) a feature resolution of the one or more features, and/or generate a sequence of (scaled) feature maps based on the image data 312. In some embodiments, one or more feature maps are provided as the output 302 of the first detection and classification model 314. Or in some embodiments, the sequence of feature maps are combined into a composite feature map that is provided as the output 302 of the first detection and classification model 314. The output 302 of the first detection and classification model 314 is used by at least an object detection stage 316 and a gesture classification stage 318.
The gesture classification stage 318 identifies the second gesture 307 based on the output 302 of the first detection and classification model 314. In particular, the gesture classification stage 318 is operable to determine information of the second gesture 307 and an associated second confidence score 308 (i.e., det Cls Conf Score) that is indicative of a level of confidence that the second gesture 307 was detected from one or more feature graphs 302 corresponding to the entire image data 312. In some embodiments, the information of the second gesture 307 generated by the gesture classification stage 318 includes a second gesture vector. Each element of the second gesture vector corresponds to a respective gesture and represents a respective first probability or confidence level of the second gesture 307 corresponding to the respective gesture. In some embodiments, the second gesture 307 and the second confidence score 308 are determined based on a second gesture vector. In some embodiments, the second gesture vector is normalized. In some embodiments, gesture classification stage 318 is performed in parallel with object detection stage 316 and classification process 320.
In some embodiments, gesture classification stage 318 is used to identify gestures of a particular application and/or system. In some embodiments, the second gesture 307 determined by the gesture classification stage 318 includes only local information that may be used to further determine the final gesture 340. The local information of the gesture includes information specific to the body performing the gesture, information specific to the gesture (e.g., exact category), and/or information specific to a particular application and/or system. More specifically, the local information is information based only on the hand or part of the body performing the gesture. For example, as shown in fig. 3, the local information may be "scissors".
The object detection stage 316 is used to detect one or more gestures in the output of the first detection and classification model 314. In particular, the object detection stage 316 is used to generate one or more bounding boxes 303 around one or more gestures detected within the image data 312. In some embodiments, each bounding box 303 corresponds to a respective first gesture 305 and is determined using a box confidence score 304 (i.e., BBox Conf Score) that indicates a confidence level of the respective bounding box 303 associated with the first gesture 305. Further, in some embodiments, the gesture area 322 is cropped and resized from the image data 312 for each first gesture 305 based on the information of the bounding box 303 (326).
The classification process 320 applies a second classification network 324 to each gesture area 322 (i.e., cropping the image 322) to determine the first gesture 305. In some embodiments, the second classification network 324 is one or more of neural networks (e.g., mobilenet v, mobilenet v, shuffleNet) known in the art. In some embodiments, the second classification network 324 selects based on an intended gesture task (e.g., an intended gesture of a particular application and/or system) and/or based on a number of classified categories (e.g., different types of gestures that may be classified by a particular application and/or system).
For each clip image 322, the second classification network 324 determines the corresponding first gesture 305 by determining the received first gesture vector for each clip image 322. Each element of the first gesture vector determined by the second classification network 324 corresponds to a respective gesture and represents a respective first probability or confidence level of the gesture region 322 that includes the respective gesture. In some embodiments, the classification process 320 determines the first gesture 305 and the first gesture confidence score 306 based on the first gesture vector. In some embodiments, the first gesture confidence score 306 of the first gesture 305 is combined with the box confidence score 304 determined by the object detection stage 316 to generate a first confidence score for the first gesture 305. The first gesture vector is normalized, which means that the sum of probability values corresponding to all gestures is equal to 1.
The information of the first gesture 305 provided by the object detection stage 316 is used to determine whether the first gesture 305 is associated with contextual information. Such contextual information is used to determine whether to select the first gesture 305 or the second gesture 307 to determine the final gesture 340 associated with the image data 312. If the first gesture 305 is not associated with contextual information, the first gesture 305 is used to determine the final gesture 340. Conversely, if the first gesture 305 is associated with contextual information, the second gesture 307 is used to determine the final gesture 340. In an example, the first gesture 305 or the second gesture 307 is used to distinguish between a gesture performed near the user's face (e.g., lifting a finger onto the lips to represent silence) and a gesture performed in space (e.g., lifting a finger), respectively. Examples of contextual information include, but are not limited to, gestures performed on and/or near a particular portion of the body, gestures performed in a particular environment (e.g., at home, at an office, on a bus, in a library, etc.), previous gestures performed (e.g., a "pick up" gesture performed prior to a "pick up" gesture), and/or gestures performing surrounding movements (e.g., pinch-in gestures and/or expand gestures to zoom in and/or out).
The first gesture 305 and the second gesture 307 are determined from the gesture area 322 and the entire image 312, respectively, and are each used to determine the final gesture 340 through the post-processing stage 330. In post-processing stage 330, the outputs of detection process 310 and classification process 320 are used together to determine final gesture 340. The post-processing stage 330 includes one or more filters applied to the first gesture 305 and the second gesture 307 and associated confidence scores 304, 306, and 308. The filter function is optionally used to identify false positives by time information from previous states. In some embodiments, the filter is represented as a function of time (e.g.) The following is shown:
Where F is a filter function. In some embodiments, in the post-processing stage 330, a selected one of the first gesture 305 and the second gesture 307 needs to stabilize for at least a predefined number of consecutive frames before being selected as the final gesture 340.
Any number of methods may be used to construct the filter function. In some embodiments, a filter function is used to smooth the output to avoid jitter and provide a smooth user experience. Non-limiting examples of filters and filtering techniques may be moving/weighted average smoothing (or convolution), fourier filtering, kalman filters, and variations thereof. Although the filtering process described above is described with respect to gestures (e.g., K cls and K det), similar filters and filtering techniques may be applied to the cartridge smoothing.
Further details of determining the final gesture 340 based on the first gesture 305 and the second gesture 307 and the associated confidence scores 304, 306, and 308 are discussed with reference to fig. 4. Note that fig. 3 illustrates an example process 300 of determining gestures, and that gesture detection and classification process 300 may be similarly applied to detect one or more of facial gestures, arm gestures, body gestures, and/or other sortable gestures performed by a user.
FIG. 4 is a flowchart of an example post-processing technique 400 to determine a final gesture 340 from a first gesture 305 and a second gesture 307 determined from image data 312, according to some embodiments. Post-processing technique 400 is an embodiment of one or more processes performed by post-processing stage 330 described above with reference to fig. 3. The post-processing technique 400 shows two branches for determining the final gesture 340—the left branch based on the second gesture 307 (Kdet), the second gesture confidence score 308 associated with the second gesture 307 (DetClsi), and the box confidence score 304 (Pbox) associated with the bounding box 303 of the first gesture 305, and the right branch based on the first gesture 305 (Kcls) and the first gesture confidence score 306 (Clsi), as described above with reference to fig. 3.
Beginning at operation 410, the post-processing technique 400 obtains the first gesture 305 (Kcls) and the first gesture confidence score 306 (Clsi) using the object detection stage 316. At operation 420, the processing technique 400 determines whether the first gesture confidence score 306 (Clsi) of the first gesture 305 (Kcls) is greater than or equal to the second threshold probability (P2). In some embodiments, the second threshold probability P2 is at least 0.10. In some embodiments, the second threshold probability P2 is any probability (e.g., at least 0.15) defined by the user and/or the system implementing the gesture detection and classification process 300. In some embodiments, the second threshold probability P2 is adjusted to obtain optimal performance and its value is highly dependent on the accuracy of the detection process 310 (e.g., including the classification stage 318).
If the first gesture confidence score 306 (Clsi) is below the second threshold probability P2 ("NO" at operation 420), then the corresponding first gesture 305 (Kcls) is determined to be an invalid gesture of the image (i.e., no gesture 480 is detected). Conversely, if the first gesture confidence score 306 (Clsi) is greater than or equal to the second threshold probability P2 ("yes" at operation 420), then the corresponding first gesture 305 (Kcls) remains a candidate gesture for the final gesture 340 and is used at operation 430. At operation 430, the post-processing technique 400 determines whether the first gesture 305 (KCls) is contextually relevant (i.e., whether the first gesture 305 is associated with contextual information). If the first gesture 305 (KCls) is not context-dependent ("no" at operation 430), then the post-processing technique 400 determines that the first gesture 305 (KCls) is a gesture category 490 (i.e., the final gesture 340). Or if the first gesture 305 (KCls) is contextually relevant ("yes" at operation 430), the post-processing technique 400 proceeds to operation 460 and utilizes the first gesture 305 (KCls) in conjunction with the second gesture 307 to determine the final gesture 340 (where the first gesture 305 is focused on the gesture area 322 and the second gesture 307 is based on the entire image).
Turning to operation 440, the post-processing technique 400 obtains the second gesture 307 (Kdet), the second gesture confidence score 308 (DetClsi) associated with the second gesture 307, and the box confidence score 304 (Pbox). At operation 450, it is determined whether the second gesture confidence score 308 (DetClsi) of the second gesture 307 is greater than or equal to the first threshold probability (P1). In some embodiments, the first threshold probability P1 is at least 0.10. In some embodiments, the first threshold probability P1 is any probability (e.g., at least 0.15) defined by the user and/or the system implementing the gesture detection and classification process 300. In some embodiments, the first threshold probability P1 is adjusted to obtain optimal performance and its value is highly dependent on the accuracy of the detection process 310 and the classification process 320.
If the second gesture confidence score 308 (DetClsi) of the second gesture 307 (Kdet) is below the first threshold probability P1 (NO at operation 450), then it is determined that the second gesture 307 (Kdet) is an invalid gesture of the image (i.e., the gesture 480 is not detected). Conversely, if the second gesture confidence score 308 (DetClsi) of the second gesture 307 (Kdet) is greater than or equal to the first threshold probability P1 (yes at operation 460), then the second gesture 307 (Kdet) remains as a candidate gesture for the final gesture 340 and is used at operation 460.
In operation 460, the post-processing technique 400 determines whether the second gesture 307 (Kdet) is the same as the first gesture 305. If the second gesture 307 (Kdet) and the first gesture 305 (KCls) are different ("no" at operation 460), then the post-processing technique 400 determines that the potential gesture is invalid (i.e., the gesture 480 is not detected). Or if the second gesture 307 (Kdet) and the first gesture 305 (KCls) are the same (yes at operation 460), the post-processing technique 400 proceeds to operation 470 and determines whether the third confidence score 402 is greater than a third threshold probability (P3). The third confidence score 402 is equal to the second gesture confidence score 308 (DetClsi) of the second gesture 307 (Kdet) multiplied by the box confidence score 304 (Pbox). In fig. 4, the third confidence score 402 is represented by DetCls Kdet Pbox. In some embodiments, the third threshold probability P3 is at least 0.10. In some embodiments, the third threshold probability P3 is any probability (e.g., at least 0.15) defined by the user and/or the system implementing the gesture detection and classification process 300. In some embodiments, the third threshold probability P3 is adjusted for optimal performance and its value is highly dependent on the accuracy of the detection process 310 and the classification process 320.
If the third confidence score 402 is less than the third threshold probability P3 ("NO" of operation 470), then the second gesture 307 (Kdet) is determined to be invalid (i.e., the gesture 480 is not detected). If the third confidence score 402 is greater than the third threshold probability P3 ("Yes" at operation 470), the post-processing technique 400 determines that the second gesture 307 (Kdet) is the gesture category 490 (i.e., the final gesture 340). By combining the information provided by the detection process 310 and the classification process 320, it may be determined whether the gesture is contextually relevant (rather than relying solely on the classification process). This overall approach in the above-described techniques improves the ability of the electronic device to detect and recognize gestures as compared to existing solutions.
FIG. 5 is a flow diagram of a method 500 of classifying one or more gestures, according to some embodiments. Method 500 includes one or more of the operations described above with reference to fig. 3 and 4. Method 500 provides a solution for gesture recognition (e.g., gestures, facial actions, arm gestures, etc.) across different electronic devices and/or systems (e.g., as described above with reference to fig. 1 and 2). The gesture determination method 500 improves the accuracy of local gesture classification (e.g., context-independent gesture classification) and contextual gesture classification (e.g., context-based contextual gesture classification) relative to existing solutions. For example, in some embodiments, the gesture determination process 500 displays an increase in accuracy of the local gesture classification of at least 4-10% compared to existing solutions, and an increase in accuracy of the contextual gesture classification of at least 43-120% compared to existing solutions.
The operations (e.g., steps) of method 500 are performed by one or more processors (e.g., CPU 202; fig. 2) of an electronic device (e.g., at server 102 and/or client device 104). At least some of the operations shown in fig. 5 correspond to instructions stored in a computer memory or computer-readable storage medium (e.g., memory 206; fig. 2). Operations 502-512 may also be performed in part using one or more processors and/or using instructions stored in a memory or computer-readable medium of one or more devices communicatively coupled together, such as a notebook computer, AR glasses, or other head mounted display, server, tablet, security camera, drone, smart television, smart speaker, toy, smart watch, smart appliance, or other computing device that may perform operations 502-512 alone or in combination with a corresponding processor of communicatively coupled electronic device 200.
The method 500 includes obtaining (502) an image 312 including a hand region 322, detecting (504) the hand region 322 in the image 312, and determining (506) a first gesture 305 from the hand region of the image. For example, as described above with reference to FIG. 3, image data 312 is obtained and (after image data 312 is processed by first detection and classification model 314) a detection process 310 (available for cropping images) and a classification process 320 are applied to image data 312 to determine a classification gesture vector for a cropped image 322 (i.e., hand region 322) fused with context information.
In some embodiments, determining the first gesture 305 from the hand region 322 of the image 312 includes generating a first gesture vector from the hand region 322 of the image, each element of the first gesture vector corresponding to a respective gesture and representing a respective first confidence level of the hand region 322 including the respective gesture, determining the first gesture 305 and the first gesture confidence score 306 from the first gesture vector. For example, as described above with reference to FIG. 3, the object detection stage 316 is applied to detect one or more gestures of the image data 312 (after passing through the first detection and classification model 314), which are cropped and used to determine a first gesture vector. In some embodiments, the method 500 further includes associating the detection of the hand region 322 in the image 312 with the bounding box confidence score 304, and combining the bounding box confidence score 304 with the first gesture confidence score 306 (Clsi) of the first gesture 305. For example, as shown above with reference to fig. 3, the output of the classification process 320 may be combined with the output of the object detection stage 316 (e.g., bounding box confidence score).
In some embodiments, the first gesture 305 includes a respective gesture corresponding to a maximum first confidence level of the respective first confidence levels of each element of the first gesture vector, and the gesture confidence score is equal to the maximum first confidence level of the respective first confidence levels of each element of the first gesture vector. In other words, the first gesture 305 may have a greater confidence score (e.g., the first confidence score 306) than other gestures in a corresponding set of one or more gestures (e.g., the first gesture 305 described above with reference to fig. 3).
The method 500 includes determining (508) a second gesture 307 from the image (e.g., the entire image). In some embodiments, determining the second gesture 307 from the image includes generating a second gesture vector from the image (e.g., the entire image), each element of the second gesture vector corresponding to a respective gesture and representing a respective second confidence level of the image including the respective gesture, determining the second gesture 307 and the second gesture confidence score 308 from the second gesture vector.
In some embodiments, the second gesture 307 comprises a respective gesture corresponding to a maximum second confidence level of the respective second confidence levels of each element of the second gesture vector, and the second gesture confidence score 308 is equal to the maximum second confidence level of the respective second confidence levels of each element of the second gesture vector. In other words, the second gesture 307 may have a greater confidence score (e.g., the second confidence score 308) than other gestures in a corresponding set of one or more gestures (e.g., the second set of one or more gestures described above with reference to fig. 3).
The method 500 includes determining (510) that a final gesture of the image is the first gesture 305 in accordance with determining that the first gesture 305 is not any gesture of the plurality of contextual gestures. For example, as shown above with reference to fig. 4, in accordance with a determination that gesture Kcls (e.g., the gesture of first gesture 305) is not contextually relevant (no at operation 430), gesture Kcls is determined to be final gesture 340. In some embodiments, the method 500 includes, prior to determining whether the first gesture 305 is at least one gesture of the plurality of contextual gestures, determining whether the first gesture confidence score 306 is greater than a second threshold P2, and in accordance with determining that the first gesture confidence score 306 is less than the second threshold P2, determining that the image is not associated with any gesture. For example, as described above with reference to fig. 4, in accordance with determining that the confidence score (e.g., cls Kcls) of the gesture Kcls is less than the second threshold probability P2 (no at operation 420), the method 500 determines that no gesture is present.
In some embodiments, the method 500 includes, prior to determining whether the first gesture 305 is at least one gesture of the plurality of contextual gestures, determining whether a second gesture confidence score 308 (DetClsi) of the second gesture 307 is greater than a first threshold P1, and determining that the image is not associated with any gesture based on determining that the second confidence score 308 (DetClsi) of the second gesture 307 is less than the first threshold P1. For example, as described above with reference to fig. 4, in accordance with determining that the confidence score (e.g., detCls Kdet) of the gesture Kdet is less than the first threshold probability P1 (no at operation 450), the method 500 determines that no gesture is present.
The method 500 further includes, in accordance with a determination that the first gesture 305 is one of a plurality of contextual gestures, determining (512) a final gesture based on the second gesture 307 and the second gesture confidence score 308, the second gesture 307 and the second gesture confidence score 308 being associated with the image (e.g., the entire image). In some embodiments, determining the final gesture 340 based on the second gesture 307 and the second confidence score 308 further includes determining that the image 312 is not associated with any gesture in accordance with determining that the first gesture 305 and the second gesture 307 are different from each other. For example, as described above with reference to fig. 4, in accordance with determining that gesture Kdet is different than gesture Kcls (yes at operation 460), method 500 determines that no gesture is present.
Or in some embodiments, in accordance with a determination that the first gesture 305 and the second gesture 307 are the same as each other and in accordance with a determination that the third confidence score 402 does not exceed the integrated confidence threshold (e.g., P3 in fig. 4), the method 500 includes determining that the image is not associated with any gesture. For example, as described above with reference to fig. 4, in accordance with determining that the third confidence score 402 (e.g., detCls Kdet Pbox-confidence score of gesture Kdet multiplied by box confidence score 304 (Pbox) (output of object detection stage 316; fig. 3)) is less than the third threshold probability P3 (no at operation 450), method 500 determines that a gesture is not present. In some embodiments, in accordance with determining that the first gesture 305 and the second gesture 307 are the same as each other and in accordance with determining that the third confidence score 402 exceeds the integrated confidence threshold P3, the method 500 includes determining that the final gesture 340 is the second gesture 307. For example, as described above with reference to fig. 4, in accordance with determining that the third confidence score 402 is greater than or equal to the third threshold probability P3 ("yes" at operation 470), the method 500 determines the second gesture (Kdet) as the final gesture 340.
In some embodiments, the first threshold P1, the second threshold P2, and the integrated confidence threshold P3 are adjusted to obtain optimal performance and are highly dependent on the accuracy of the processes 310 and 320.
In some embodiments, the method 500 further includes filtering the final gesture using a filtering function. The filter function is used to identify false positives. In some embodiments, the filter function is one of a convolution function (or a moving and/or weighted average smoothing function), a fourier filter function, or a kalman filter. In some embodiments, the filtering function is a function of time that allows the determination of the first gesture and/or the second gesture to be stable (e.g., stable for at least 5 consecutive frames). In some embodiments, the filter function helps smooth out the cartridge and identified categories, helps avoid jitter and loss of detection in the process, and helps make it easy to engineer (e.g., implement gesture control for volume adjustment). Additional information about the filter is provided in fig. 3 above.
It should be understood that the particular order of operations in fig. 5 that has been described is merely exemplary and is not intended to indicate that the described order is the only order in which operations may be performed. Those of ordinary skill in the art will recognize various ways of retrieving candidate images or determining camera pose as described herein. Additionally, it should be noted that the details of the other processes described above with respect to fig. 3 and 4 also apply in a similar manner to the method 500 described above with respect to fig. 5. For brevity, these details are not repeated here.
The terminology used in the description of the various embodiments described herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the description of the various embodiments described and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. In addition, it will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element.
As used herein, the term "if" is optionally interpreted to mean "when..once..once more" or "when..once more" or "in response to a determination" or "in response to a detection" or "in accordance with a determination of..once more" depending on the context. Similarly, the phrase "if a determination" or "if a [ stated condition or event ] is detected" is optionally interpreted as meaning "at the time of determination of..once..once the term" or "in response to a determination" or "when a [ stated condition or event ] is detected" or "in response to a detection of a [ stated condition or event ]" or "in accordance with a determination of a [ stated condition or event ] is detected", depending on the context.
The foregoing description, for purposes of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of operation and the practical application, thereby enabling others skilled in the art to practice.
Although the various figures show a plurality of logic stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or fetched. While some reordering or other groupings are specifically mentioned, other ordering and groupings will be apparent to those of ordinary skill in the art, and thus the ordering and groupings presented herein are not an exhaustive list of alternatives. Furthermore, it should be appreciated that these stages may be implemented in hardware, firmware, software, or any combination thereof.

Claims (14)

1. A gesture classification method implemented by an electronic device, the method comprising:
Obtaining an image comprising a hand region;
detecting the hand region in the image;
determining a second gesture from the image;
determining a first gesture from the hand region of the image;
In accordance with a determination that the first gesture is not any of a plurality of contextual gestures, determining that a final gesture of the image is the first gesture, and
In accordance with a determination that the first gesture is one of the plurality of contextual gestures, the final gesture is determined based on the second gesture and a second gesture confidence score, the second gesture and the second gesture confidence score being associated with the image.
2. The method of claim 1, wherein determining the second gesture from the image further comprises:
Generating a second gesture vector from the image, each element of the second gesture vector corresponding to a respective gesture and representing a respective second confidence level of the image including the respective gesture, and
The second gesture and the second gesture confidence score are determined from the second gesture vector.
3. The method of claim 2, wherein the second gesture comprises a respective gesture corresponding to a maximum second confidence level of the respective second confidence levels of each element of the second gesture vector, and the second gesture confidence score is equal to the maximum second confidence level of the respective second confidence levels of each element of the second gesture vector.
4. A method according to any one of claims 1-3, further comprising:
determining, prior to determining whether the first gesture is at least one gesture of the plurality of contextual gestures, whether a first gesture confidence score is greater than a second threshold;
In accordance with a determination that the first gesture confidence score is less than the second threshold, it is determined that the image is not associated with any gesture.
5. The method of any of the preceding claims, wherein determining the first gesture from the hand region of the image further comprises:
generating a first gesture vector from the hand region of the image, each element of the first gesture vector corresponding to a respective gesture and representing a respective first confidence level of the hand region including the respective gesture;
the first gesture and a first gesture confidence score are determined from the first gesture vector.
6. The method of claim 5, further comprising:
Correlating the detection of the hand region in the image with bounding box confidence scores, and
The bounding box confidence score is combined with the confidence score associated with the first gesture to generate the first gesture confidence score.
7. The method of claim 5 or 6, wherein the first gesture comprises a respective gesture corresponding to a maximum first confidence level of the respective first confidence levels of each element of the first gesture vector, and the first gesture confidence score is equal to the maximum first confidence level of the respective first confidence levels of each element of the first gesture vector.
8. The method of any of claims 5-7, further comprising:
Determining, prior to determining whether the first gesture is at least one gesture of the plurality of contextual gestures, whether the second gesture confidence score of the second gesture is greater than a first threshold;
In accordance with a determination that the second gesture confidence score for the second gesture is less than the first threshold, it is determined that the image is not associated with any gesture.
9. The method of any of the preceding claims, wherein determining the final gesture based on the second gesture and the second gesture confidence score further comprises:
in accordance with a determination that the first gesture and the second gesture are different from each other, determining that the image is not associated with any gesture, and
In accordance with a determination that the first gesture and the second gesture are the same as each other:
In accordance with a determination that the third confidence score exceeds a comprehensive confidence threshold, determining that the final gesture is the second gesture, and
In accordance with a determination that the third confidence score does not exceed the integrated confidence threshold, it is determined that the image is not associated with any gestures.
10. The method of any of the preceding claims, further comprising filtering the final gesture using a filter function, wherein the filter function is used to identify false positives.
11. The method of claim 10, wherein the filter function is one of a convolution function, a fourier filter function, or a kalman filter.
12. The method of claim 10 or 11, wherein the filter function is a function of time.
13. An electronic device, comprising:
one or more processors, and
A memory having instructions stored thereon that, when executed by the one or more processors, cause the processors to perform the method of any of claims 1-12.
14. A non-transitory computer-readable medium having instructions stored thereon, which when executed by one or more processors, cause the one or more processors to perform the method of any of claims 1-12.
CN202280095839.0A 2022-05-13 2022-05-13 Context-based gesture recognition Pending CN119183554A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2022/029161 WO2023219629A1 (en) 2022-05-13 2022-05-13 Context-based hand gesture recognition

Publications (1)

Publication Number Publication Date
CN119183554A true CN119183554A (en) 2024-12-24

Family

ID=88730787

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280095839.0A Pending CN119183554A (en) 2022-05-13 2022-05-13 Context-based gesture recognition

Country Status (2)

Country Link
CN (1) CN119183554A (en)
WO (1) WO2023219629A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100199231A1 (en) * 2009-01-30 2010-08-05 Microsoft Corporation Predictive determination
US8768006B2 (en) * 2010-10-19 2014-07-01 Hewlett-Packard Development Company, L.P. Hand gesture recognition
US20130155237A1 (en) * 2011-12-16 2013-06-20 Microsoft Corporation Interacting with a mobile device within a vehicle using gestures
US9207771B2 (en) * 2013-07-08 2015-12-08 Augmenta Oy Gesture based user interface
RU2014108870A (en) * 2014-03-06 2015-09-20 ЭлЭсАй Корпорейшн IMAGE PROCESSOR CONTAINING A GESTURE RECOGNITION SYSTEM WITH A FIXED BRUSH POSITION RECOGNITION BASED ON THE FIRST AND SECOND SET OF SIGNS

Also Published As

Publication number Publication date
WO2023219629A1 (en) 2023-11-16

Similar Documents

Publication Publication Date Title
CN105981368B (en) Picture composition and position guidance in an imaging device
US10755087B2 (en) Automated image capture based on emotion detection
CN107831902B (en) Motion control method and device, storage medium and terminal
WO2023101679A1 (en) Text-image cross-modal retrieval based on virtual word expansion
US12217545B2 (en) Multiple perspective hand tracking
WO2021178980A1 (en) Data synchronization and pose prediction in extended reality
WO2022103877A1 (en) Realistic audio driven 3d avatar generation
US20240203152A1 (en) Method for identifying human poses in an image, computer system, and non-transitory computer-readable medium
KR20210059576A (en) Method of processing image based on artificial intelligence and image processing device performing the same
WO2023069591A1 (en) Object-based dual cursor input and guiding system
WO2023023162A1 (en) 3d semantic plane detection and reconstruction from multi-view stereo (mvs) images
WO2023091131A1 (en) Methods and systems for retrieving images based on semantic plane features
CN119183554A (en) Context-based gesture recognition
CN119563195A (en) Closed-loop pose detection and mapping in SLAM
WO2023219612A1 (en) Adaptive resizing of manipulatable and readable objects
WO2023277877A1 (en) 3d semantic plane detection and reconstruction
WO2023063944A1 (en) Two-stage hand gesture recognition
WO2023211435A1 (en) Depth estimation for slam systems using monocular cameras
CN119096265A (en) Real-time on-device long-distance gesture recognition using a lightweight deep learning model
CN119723679B (en) Multi-modal behavior analysis method, device, equipment and storage medium
CN117576245B (en) Method and device for converting style of image, electronic equipment and storage medium
CN112565586A (en) Automatic focusing method and device
WO2024232882A1 (en) Systems and methods for multi-view depth estimation using simultaneous localization and mapping
CN120167034A (en) Light estimation using ambient light sensors for augmented reality applications
WO2023129162A1 (en) Real-time lightweight video tracking, processing, and rendering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination