Detailed Description
Reference will now be made in detail to the specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to provide an understanding of the subject matter presented herein. It will be apparent, however, to one skilled in the art that various alternatives may be used without departing from the scope of the claims, and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein may be implemented on many types of electronic devices having digital video capabilities.
FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, according to some embodiments. The one or more client devices 104 may be, for example, a desktop computer 104A, a tablet computer 104B, a mobile phone 104C, a head-mounted display (HMD) (also referred to as augmented reality (augmented reality, AR) glasses) 104D, or a smart multi-sensor networking home device (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 may collect data or user input, execute user applications, and present output on its user interface. The collected data or user input may be processed locally at the client device 104 and/or remotely by the server 102. One or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to client devices 104, and in some embodiments, process data and user inputs received from client devices 104 when the user applications are executed on client devices 104. In some embodiments, the data processing environment 100 also includes a memory 106, the memory 106 for storing data related to the server 102, the client device 104, and applications executing on the client device 104.
The one or more servers 102 are operable to enable real-time data communication with client devices 104 that are remote from each other or client devices 104 that are remote from the one or more servers 102. Further, in some embodiments, one or more servers 102 are configured to perform data processing tasks that cannot be accomplished locally by or that would not be prioritized by client device 104. For example, the client device 104 includes a game console (e.g., HMD 104D) executing an interactive online game application. The game console receives the user instructions and sends them to the game server 102 along with the user data. The game server 102 generates a video data stream based on the user instructions and user data and provides the video data stream for display on the game console and other client devices that participate in the same game session with the game console. In another example, the client device 104 includes a networked monitoring camera 104E and a mobile phone 104C. The networked monitoring camera 104E collects video data and transmits the video data to the monitoring camera server 102 in real time. Although the video data is optionally pre-processed by the monitoring camera 104E, the monitoring camera server 102 processes the video data to identify motion or audio events in the video data and shares information of those events with the mobile phone 104C, thereby allowing the user of the mobile phone 104 to remotely monitor in real time events occurring in the vicinity of the networked monitoring camera 104E.
One or more servers 102, one or more client devices 104, and memory 106 are communicatively coupled to one another via one or more communication networks 108, the communication networks 108 being media used to provide communication links between these devices within the data processing environment 100 and computers connected together. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include a local area network (local area network, LAN), a wide area network (wide area network, WAN) such as the internet, or a combination thereof. One or more of the communication networks 108 may alternatively be implemented using any known network protocol, including various wired or wireless protocols such as Ethernet, universal serial bus (universal serial bus, USB), firewire, long term evolution (long term evolution, LTE), global system for mobile communications (global system for mobile communications, GSM), enhanced data GSM environment (ENHANCED DATA GSM environment, EDGE), code division multiple access (code division multiple access, CDMA), time division multiple access (time division multiple access, TDMA), bluetooth, wi-Fi, voice over Internet protocol (voice over Internet protocol, voIP), wi-MAX, or any other suitable communication protocol. Connections to one or more communication networks 108 may be established directly (e.g., using 3G/4G connections to wireless carriers), or through a network interface 110 (e.g., a router, switch, gateway, hub, or intelligent private home control node), or through any combination thereof. As such, the one or more communication networks 108 may represent the internet of a worldwide collection of networks and gateways that use the transmission control protocol/internet protocol (transmission control protocol/Internet protocol, TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages.
In some embodiments, deep learning techniques are applied to the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executing at the client device 104, to identify information contained in the content data, to match the content data with other data, to classify the content data, or to synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensors of the client device 104. In these deep learning techniques, a data processing model is created based on one or more neural networks to process content data. These data processing models are trained with training data before being applied to process content data. After model training, the mobile phone 104C or HMD 104D obtains content data (e.g., captures video data through an internal camera) and processes the content data locally using a data processing model.
In some embodiments, model training and data processing is implemented locally on each individual client device 104 (e.g., mobile phone 104C and HMD 104D). The client device 104 obtains training data from one or more servers 102 or memories 106 and applies the training data to train the data processing model. Or in some embodiments, model training and data processing is implemented remotely on a server 102 (e.g., server 102A) associated with a client device 104 (e.g., client device 104A and HMD 104D). The server 102A obtains training data from itself, another server 102, or the memory 106 and applies the training data to train the data processing model. The client device 104 obtains content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing model, receives data processing results (e.g., recognized gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the gestures, or implements some other function based on the results. The client device 104 itself does no or little data processing of the content data prior to sending the content data to the server 102A. Further, in some embodiments, data processing is implemented locally on the client device 104 (e.g., client device 104B and HMD 104D), while model training is implemented remotely on a server 102 (e.g., server 102B) associated with the client device 104. Server 102B obtains training data from itself, another server 102, or memory 106 and applies the training data to train the data processing model. The trained data processing model is optionally stored in server 102B or memory 106. Client device 104 imports a trained data processing model from server 102B or memory 106, processes content data using the data processing model, and generates data processing results for presentation on a user interface or for locally launching functions (e.g., rendering virtual objects based on device gestures).
In some embodiments, a pair of AR glasses 104D (also referred to as HMDs) are communicatively coupled in the data processing environment 100. AR glasses 104D include a camera, microphone, speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. A camera and microphone are used to capture video and audio data from the scene in which AR glasses 104D are located, while one or more inertial sensors are used to capture inertial sensor data. In some cases, the camera captures gestures of the user wearing the AR glasses 104D and recognizes the gestures locally in real-time using a gesture recognition model. In some cases, the microphone records ambient sounds, including voice commands of the user. In some cases, both video or still visual data captured by the camera and inertial sensor data measured by one or more inertial sensors are used to determine and predict device pose. Video, still images, audio, or inertial sensor data captured by the AR glasses 104D, the server 102, or both, are processed by the AR glasses 104D to recognize device gestures. Optionally, server 102 and AR glasses 104D together apply deep learning techniques to recognize and predict device gestures. The device gestures are used to control the AR glasses 104D itself or to interact with applications (e.g., gaming applications) executed by the AR glasses 104D. In some embodiments, the display of AR glasses 104D displays a user interface, and the recognized or predicted device gestures are used to render or interact with a user-selectable display item (e.g., avatar) on the user interface.
As described above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, still image data, or inertial sensor data captured by the AR glasses 104D. A 2D or 3D device pose is identified and predicted based on such video, still image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using the second data processing model. Training of the first data processing model and the second data processing model is optionally performed by the server 102 or the AR glasses 104D. The inference of device gestures and visual content is accomplished by the server 102 and AR glasses 104D each independently or by the server 102 and AR glasses 104D together.
Fig. 2 is a block diagram illustrating an electronic device 200 for processing content data (e.g., image data) in accordance with some embodiments. The electronic device 200 is one of the server 102, the client device 104 (e.g., AR glasses 104D in fig. 1), the memory 106, or a combination thereof. In an example, the electronic device 200 is a mobile device that includes a gesture recognition module 230, the gesture recognition module 230 applying a neural network model (e.g., in fig. 3) end-to-end to locally recognize gestures at the mobile device. The electronic device 200 typically includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic device 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, mouse, voice command input unit or microphone, touch screen display, touch-sensitive tablet, gesture-capturing camera, or other input buttons or controls. Further, in some embodiments, the electronic device 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace a keyboard. In some embodiments, the electronic device 200 includes one or more optical cameras (e.g., RGB cameras 260), scanners, or light sensor units for capturing images of, for example, a graphical serial code printed on the electronic device. In some embodiments, the electronic device 200 also includes one or more output devices 212 for presenting user interfaces and displaying content, including one or more speakers and/or one or more visual displays. Optionally, the electronic device 200 comprises a location detection device, such as a GPS (global positioning system) or other geographical location receiver, for determining the location of the electronic device 200. Optionally, the electronic device 200 comprises an inertial measurement unit (inertial measurement unit, IMU) 280, the IMU 280 integrating sensor data captured by the multi-axis inertial sensor to estimate the position and orientation of the electronic device 200 in space. Examples of one or more inertial sensors of IMU 280 include, but are not limited to, gyroscopes, accelerometers, magnetometers, and inclinometers.
Alternatively or in addition, in some embodiments, the electronic device 200 is communicatively coupled to one or more devices (e.g., server 102, client device 104, memory 106, or a combination thereof) including one or more input devices 210, output devices 212, IMU 280, or other components described above, via one or more network interfaces 204, and provides data to the electronic device 200.
Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices, and optionally includes non-volatile memory, such as one or more magnetic disk memory, one or more optical disk memory, one or more flash memory devices, or one or more other non-volatile solid state memory. Optionally, the memory 206 includes one or more memories remote from the one or more processing units 202. The memory 206 or alternatively the non-volatile memory within the memory 206 includes a non-transitory computer-readable storage medium. In some embodiments, memory 206 or a non-transitory computer readable storage medium of memory 206 stores the following programs, modules, and data structures, or a subset or superset thereof:
An operating system 214 including processes for handling various basic system services and for performing hardware-related tasks;
A network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or memory 106) via one or more (wired or wireless) network interfaces 204 and one or more communication networks 108, such as the internet, other wide area networks, local area networks, metropolitan area networks, etc.;
A user interface module 218 for enabling presentation of information (e.g., graphical user interfaces of applications 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., display, speaker, etc.);
an input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected inputs or interactions;
a web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and their web pages, including a network interface for logging into a user account associated with the client device 104 or another electronic device, for controlling the client or electronic device when associated with the user account, and for editing and viewing settings and data associated with the user account;
One or more user applications 224 executed by the electronic device 200 (e.g., games, social networking applications, smart home applications, and/or other web-based or non-web-based applications for controlling another electronic device and viewing data captured by such device);
model training module 226 for receiving training data and building a data processing model for processing content data (e.g., video, image, audio, or text data) to be collected or obtained by client device 104;
A data processing module 228 for processing the content data using the data processing model 250 to identify information contained in the content data, match the content data with other data, classify the content data, or synthesize related content data, wherein in some embodiments the data processing module 228 is associated with one of the user applications 224 to process the content data in response to user instructions received from the user application 224, in an example the data processing module 228 is applied to implement the gesture detection and classification process 300 in fig. 3;
A gesture classification module 230 for classifying one or more gestures in the image (as shown and described below with reference to fig. 3 and 4), wherein the gesture recognition module 230 further comprises a detection module 232 for detecting one or more objects in the image and/or a classification module 234 for classifying one or more gestures in a region or portion of the image and/or the entire image, the image data being processed jointly by the detection process 310 and the classification process 320 of the gesture recognition module 230 and the data processing module 228;
one or more databases 240 for storing at least data including one or more of:
device settings 242, including common device settings (e.g., service layer, device model, storage capacity, processing power, communication power, etc.) for one or more of server 102 or client device 104;
User account information 244 for one or more user applications 224, such as user name, security questions, account history data, user preferences, and predefined account settings;
Network parameters 246 of one or more communication networks 108, such as IP address, subnet mask, default gateway,
DNS server, and hostname;
training data 248 for training one or more data processing models 250;
a data processing model 250 for processing content data (e.g., video, image, audio, or text data) using a deep learning technique, wherein the data processing model 250 includes an image compression model for implementing an image compression process, a feature extraction model for implementing a multi-scale feature extraction process, and/or one or more classification models and networks as described below with reference to fig. 3 and 4;
gesture database 252 for storing one or more gestures associated with an image (e.g., stored in a database
(E.g., in memory 204), and
The content data and results 254 are obtained and output by the electronic device 200 (or a device communicatively coupled to the electronic device 200 (e.g., the client device 104)), respectively, where the content data is processed locally at the client device 104 or remotely at the server 102 by the data processing model 250 to provide relevant results to be presented on the client device 104.
Optionally, one or more databases 240 are stored at one of the server 102, the client device 104, and the memory 106 of the electronic device 200. Optionally, one or more databases 240 are distributed among more than one of the server 102, client device 104, and memory 106 of the electronic device 200. In some embodiments, more than one copy of the above data is stored at different devices, e.g., two copies of the data processing model 250 are stored at the server 102 and the memory 106, respectively.
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices and correspond to a set of instructions for performing the above described functions. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules, or data structures, and thus various subsets of these modules may be combined or otherwise rearranged in various embodiments. In some embodiments, memory 206 optionally stores a subset of the modules and data structures identified above. Further, memory 206 may optionally store additional modules and data structures not described above.
FIG. 3 is a flow diagram of a gesture detection and classification process 300 using image data 312, according to some embodiments. In some embodiments, gesture detection and classification process 300 is used to detect and identify gestures captured at a distance of at least 0.5 meters to 2 meters from the imaging device capturing image data 312. In some embodiments, gesture detection and classification process 300 is used to detect and identify gestures captured more than 2 meters from the imaging device capturing image data 312. The gesture detection and classification process 300 is optionally performed by one or more of the client devices 104, the server 102, and/or combinations thereof described above with reference to fig. 1 and 2. Gesture detection and classification process 300 includes a detection process 310, followed by a classification process 320, followed by a post-processing stage 330 that determines a final gesture 340.
In some embodiments, the detection process 310 includes applying a first detection and classification model 314 to the received image data 312. The first detection and classification model 314 generates one or more feature graphs 302 for object detection in an object detection stage 316 and gesture classification in a gesture classification stage 318. The detection process 310 is used to provide a first output through an object detection stage 316 and a second output through a gesture classification stage 318 to determine a final gesture 340. The first output of the object detection stage 316 includes information of bounding boxes 303 and associated box confidence scores 304 for each first gesture 305. In some embodiments, the information of bounding box 303 is used to generate a crop image 322 of the close-bounding gesture, and a second classification network 324 is applied to determine the first gesture 305 from the crop image 322. The second output of the gesture classification stage 318 includes information of the second gesture 307 and the associated second confidence score 308.
In some embodiments, the image data 312 is captured by the input device 210 (e.g., the RGB camera 260) of the electronic device 200 (fig. 2). The image data 312 is optionally processed locally at the electronic device 200. Or the image data 312 is uploaded to the server 102 or transmitted to a different electronic device 200. The different electronic device 200 obtains the image 312 from the electronic device 200 with the camera 260 or downloads the image 312 from the server 102 through the web browser module 222 and/or one or more user applications 224. In some embodiments, image data 312 includes one or more gestures. In some embodiments, the gesture in the image data 312 is at least 4 meters from the electronic device 200 capturing the image data 312. Non-limiting examples of one or more gestures include one or more gestures, facial actions, and body gestures. Image data 312 is received at an initial resolution (e.g., 1080 p).
The image data 312 is passed through a first detection and classification model 314 to compress the image data 312 and/or to generate one or more feature maps 302 from the image data 312. In some embodiments, the image data 312 is processed (e.g., scaled down using one or more neural networks (e.g., one or more convolutional neural networks)) before passing through the first detection and classification model 314. The first detection and classification model 314 includes one or more machine learning models. For example, in some embodiments, the first detection and classification model 314 includes one or more convolutional neural networks (convolution neural network, CNN) known in the art. In some embodiments, one or more machine learning models are used to identify and enrich (e.g., extract details from) one or more features in the image data 312, and narrow down (relative to an initial resolution of the image data 312) a feature resolution of the one or more features, and/or generate a sequence of (scaled) feature maps based on the image data 312. In some embodiments, one or more feature maps are provided as the output 302 of the first detection and classification model 314. Or in some embodiments, the sequence of feature maps are combined into a composite feature map that is provided as the output 302 of the first detection and classification model 314. The output 302 of the first detection and classification model 314 is used by at least an object detection stage 316 and a gesture classification stage 318.
The gesture classification stage 318 identifies the second gesture 307 based on the output 302 of the first detection and classification model 314. In particular, the gesture classification stage 318 is operable to determine information of the second gesture 307 and an associated second confidence score 308 (i.e., det Cls Conf Score) that is indicative of a level of confidence that the second gesture 307 was detected from one or more feature graphs 302 corresponding to the entire image data 312. In some embodiments, the information of the second gesture 307 generated by the gesture classification stage 318 includes a second gesture vector. Each element of the second gesture vector corresponds to a respective gesture and represents a respective first probability or confidence level of the second gesture 307 corresponding to the respective gesture. In some embodiments, the second gesture 307 and the second confidence score 308 are determined based on a second gesture vector. In some embodiments, the second gesture vector is normalized. In some embodiments, gesture classification stage 318 is performed in parallel with object detection stage 316 and classification process 320.
In some embodiments, gesture classification stage 318 is used to identify gestures of a particular application and/or system. In some embodiments, the second gesture 307 determined by the gesture classification stage 318 includes only local information that may be used to further determine the final gesture 340. The local information of the gesture includes information specific to the body performing the gesture, information specific to the gesture (e.g., exact category), and/or information specific to a particular application and/or system. More specifically, the local information is information based only on the hand or part of the body performing the gesture. For example, as shown in fig. 3, the local information may be "scissors".
The object detection stage 316 is used to detect one or more gestures in the output of the first detection and classification model 314. In particular, the object detection stage 316 is used to generate one or more bounding boxes 303 around one or more gestures detected within the image data 312. In some embodiments, each bounding box 303 corresponds to a respective first gesture 305 and is determined using a box confidence score 304 (i.e., BBox Conf Score) that indicates a confidence level of the respective bounding box 303 associated with the first gesture 305. Further, in some embodiments, the gesture area 322 is cropped and resized from the image data 312 for each first gesture 305 based on the information of the bounding box 303 (326).
The classification process 320 applies a second classification network 324 to each gesture area 322 (i.e., cropping the image 322) to determine the first gesture 305. In some embodiments, the second classification network 324 is one or more of neural networks (e.g., mobilenet v, mobilenet v, shuffleNet) known in the art. In some embodiments, the second classification network 324 selects based on an intended gesture task (e.g., an intended gesture of a particular application and/or system) and/or based on a number of classified categories (e.g., different types of gestures that may be classified by a particular application and/or system).
For each clip image 322, the second classification network 324 determines the corresponding first gesture 305 by determining the received first gesture vector for each clip image 322. Each element of the first gesture vector determined by the second classification network 324 corresponds to a respective gesture and represents a respective first probability or confidence level of the gesture region 322 that includes the respective gesture. In some embodiments, the classification process 320 determines the first gesture 305 and the first gesture confidence score 306 based on the first gesture vector. In some embodiments, the first gesture confidence score 306 of the first gesture 305 is combined with the box confidence score 304 determined by the object detection stage 316 to generate a first confidence score for the first gesture 305. The first gesture vector is normalized, which means that the sum of probability values corresponding to all gestures is equal to 1.
The information of the first gesture 305 provided by the object detection stage 316 is used to determine whether the first gesture 305 is associated with contextual information. Such contextual information is used to determine whether to select the first gesture 305 or the second gesture 307 to determine the final gesture 340 associated with the image data 312. If the first gesture 305 is not associated with contextual information, the first gesture 305 is used to determine the final gesture 340. Conversely, if the first gesture 305 is associated with contextual information, the second gesture 307 is used to determine the final gesture 340. In an example, the first gesture 305 or the second gesture 307 is used to distinguish between a gesture performed near the user's face (e.g., lifting a finger onto the lips to represent silence) and a gesture performed in space (e.g., lifting a finger), respectively. Examples of contextual information include, but are not limited to, gestures performed on and/or near a particular portion of the body, gestures performed in a particular environment (e.g., at home, at an office, on a bus, in a library, etc.), previous gestures performed (e.g., a "pick up" gesture performed prior to a "pick up" gesture), and/or gestures performing surrounding movements (e.g., pinch-in gestures and/or expand gestures to zoom in and/or out).
The first gesture 305 and the second gesture 307 are determined from the gesture area 322 and the entire image 312, respectively, and are each used to determine the final gesture 340 through the post-processing stage 330. In post-processing stage 330, the outputs of detection process 310 and classification process 320 are used together to determine final gesture 340. The post-processing stage 330 includes one or more filters applied to the first gesture 305 and the second gesture 307 and associated confidence scores 304, 306, and 308. The filter function is optionally used to identify false positives by time information from previous states. In some embodiments, the filter is represented as a function of time (e.g.) The following is shown:
Where F is a filter function. In some embodiments, in the post-processing stage 330, a selected one of the first gesture 305 and the second gesture 307 needs to stabilize for at least a predefined number of consecutive frames before being selected as the final gesture 340.
Any number of methods may be used to construct the filter function. In some embodiments, a filter function is used to smooth the output to avoid jitter and provide a smooth user experience. Non-limiting examples of filters and filtering techniques may be moving/weighted average smoothing (or convolution), fourier filtering, kalman filters, and variations thereof. Although the filtering process described above is described with respect to gestures (e.g., K cls and K det), similar filters and filtering techniques may be applied to the cartridge smoothing.
Further details of determining the final gesture 340 based on the first gesture 305 and the second gesture 307 and the associated confidence scores 304, 306, and 308 are discussed with reference to fig. 4. Note that fig. 3 illustrates an example process 300 of determining gestures, and that gesture detection and classification process 300 may be similarly applied to detect one or more of facial gestures, arm gestures, body gestures, and/or other sortable gestures performed by a user.
FIG. 4 is a flowchart of an example post-processing technique 400 to determine a final gesture 340 from a first gesture 305 and a second gesture 307 determined from image data 312, according to some embodiments. Post-processing technique 400 is an embodiment of one or more processes performed by post-processing stage 330 described above with reference to fig. 3. The post-processing technique 400 shows two branches for determining the final gesture 340—the left branch based on the second gesture 307 (Kdet), the second gesture confidence score 308 associated with the second gesture 307 (DetClsi), and the box confidence score 304 (Pbox) associated with the bounding box 303 of the first gesture 305, and the right branch based on the first gesture 305 (Kcls) and the first gesture confidence score 306 (Clsi), as described above with reference to fig. 3.
Beginning at operation 410, the post-processing technique 400 obtains the first gesture 305 (Kcls) and the first gesture confidence score 306 (Clsi) using the object detection stage 316. At operation 420, the processing technique 400 determines whether the first gesture confidence score 306 (Clsi) of the first gesture 305 (Kcls) is greater than or equal to the second threshold probability (P2). In some embodiments, the second threshold probability P2 is at least 0.10. In some embodiments, the second threshold probability P2 is any probability (e.g., at least 0.15) defined by the user and/or the system implementing the gesture detection and classification process 300. In some embodiments, the second threshold probability P2 is adjusted to obtain optimal performance and its value is highly dependent on the accuracy of the detection process 310 (e.g., including the classification stage 318).
If the first gesture confidence score 306 (Clsi) is below the second threshold probability P2 ("NO" at operation 420), then the corresponding first gesture 305 (Kcls) is determined to be an invalid gesture of the image (i.e., no gesture 480 is detected). Conversely, if the first gesture confidence score 306 (Clsi) is greater than or equal to the second threshold probability P2 ("yes" at operation 420), then the corresponding first gesture 305 (Kcls) remains a candidate gesture for the final gesture 340 and is used at operation 430. At operation 430, the post-processing technique 400 determines whether the first gesture 305 (KCls) is contextually relevant (i.e., whether the first gesture 305 is associated with contextual information). If the first gesture 305 (KCls) is not context-dependent ("no" at operation 430), then the post-processing technique 400 determines that the first gesture 305 (KCls) is a gesture category 490 (i.e., the final gesture 340). Or if the first gesture 305 (KCls) is contextually relevant ("yes" at operation 430), the post-processing technique 400 proceeds to operation 460 and utilizes the first gesture 305 (KCls) in conjunction with the second gesture 307 to determine the final gesture 340 (where the first gesture 305 is focused on the gesture area 322 and the second gesture 307 is based on the entire image).
Turning to operation 440, the post-processing technique 400 obtains the second gesture 307 (Kdet), the second gesture confidence score 308 (DetClsi) associated with the second gesture 307, and the box confidence score 304 (Pbox). At operation 450, it is determined whether the second gesture confidence score 308 (DetClsi) of the second gesture 307 is greater than or equal to the first threshold probability (P1). In some embodiments, the first threshold probability P1 is at least 0.10. In some embodiments, the first threshold probability P1 is any probability (e.g., at least 0.15) defined by the user and/or the system implementing the gesture detection and classification process 300. In some embodiments, the first threshold probability P1 is adjusted to obtain optimal performance and its value is highly dependent on the accuracy of the detection process 310 and the classification process 320.
If the second gesture confidence score 308 (DetClsi) of the second gesture 307 (Kdet) is below the first threshold probability P1 (NO at operation 450), then it is determined that the second gesture 307 (Kdet) is an invalid gesture of the image (i.e., the gesture 480 is not detected). Conversely, if the second gesture confidence score 308 (DetClsi) of the second gesture 307 (Kdet) is greater than or equal to the first threshold probability P1 (yes at operation 460), then the second gesture 307 (Kdet) remains as a candidate gesture for the final gesture 340 and is used at operation 460.
In operation 460, the post-processing technique 400 determines whether the second gesture 307 (Kdet) is the same as the first gesture 305. If the second gesture 307 (Kdet) and the first gesture 305 (KCls) are different ("no" at operation 460), then the post-processing technique 400 determines that the potential gesture is invalid (i.e., the gesture 480 is not detected). Or if the second gesture 307 (Kdet) and the first gesture 305 (KCls) are the same (yes at operation 460), the post-processing technique 400 proceeds to operation 470 and determines whether the third confidence score 402 is greater than a third threshold probability (P3). The third confidence score 402 is equal to the second gesture confidence score 308 (DetClsi) of the second gesture 307 (Kdet) multiplied by the box confidence score 304 (Pbox). In fig. 4, the third confidence score 402 is represented by DetCls Kdet Pbox. In some embodiments, the third threshold probability P3 is at least 0.10. In some embodiments, the third threshold probability P3 is any probability (e.g., at least 0.15) defined by the user and/or the system implementing the gesture detection and classification process 300. In some embodiments, the third threshold probability P3 is adjusted for optimal performance and its value is highly dependent on the accuracy of the detection process 310 and the classification process 320.
If the third confidence score 402 is less than the third threshold probability P3 ("NO" of operation 470), then the second gesture 307 (Kdet) is determined to be invalid (i.e., the gesture 480 is not detected). If the third confidence score 402 is greater than the third threshold probability P3 ("Yes" at operation 470), the post-processing technique 400 determines that the second gesture 307 (Kdet) is the gesture category 490 (i.e., the final gesture 340). By combining the information provided by the detection process 310 and the classification process 320, it may be determined whether the gesture is contextually relevant (rather than relying solely on the classification process). This overall approach in the above-described techniques improves the ability of the electronic device to detect and recognize gestures as compared to existing solutions.
FIG. 5 is a flow diagram of a method 500 of classifying one or more gestures, according to some embodiments. Method 500 includes one or more of the operations described above with reference to fig. 3 and 4. Method 500 provides a solution for gesture recognition (e.g., gestures, facial actions, arm gestures, etc.) across different electronic devices and/or systems (e.g., as described above with reference to fig. 1 and 2). The gesture determination method 500 improves the accuracy of local gesture classification (e.g., context-independent gesture classification) and contextual gesture classification (e.g., context-based contextual gesture classification) relative to existing solutions. For example, in some embodiments, the gesture determination process 500 displays an increase in accuracy of the local gesture classification of at least 4-10% compared to existing solutions, and an increase in accuracy of the contextual gesture classification of at least 43-120% compared to existing solutions.
The operations (e.g., steps) of method 500 are performed by one or more processors (e.g., CPU 202; fig. 2) of an electronic device (e.g., at server 102 and/or client device 104). At least some of the operations shown in fig. 5 correspond to instructions stored in a computer memory or computer-readable storage medium (e.g., memory 206; fig. 2). Operations 502-512 may also be performed in part using one or more processors and/or using instructions stored in a memory or computer-readable medium of one or more devices communicatively coupled together, such as a notebook computer, AR glasses, or other head mounted display, server, tablet, security camera, drone, smart television, smart speaker, toy, smart watch, smart appliance, or other computing device that may perform operations 502-512 alone or in combination with a corresponding processor of communicatively coupled electronic device 200.
The method 500 includes obtaining (502) an image 312 including a hand region 322, detecting (504) the hand region 322 in the image 312, and determining (506) a first gesture 305 from the hand region of the image. For example, as described above with reference to FIG. 3, image data 312 is obtained and (after image data 312 is processed by first detection and classification model 314) a detection process 310 (available for cropping images) and a classification process 320 are applied to image data 312 to determine a classification gesture vector for a cropped image 322 (i.e., hand region 322) fused with context information.
In some embodiments, determining the first gesture 305 from the hand region 322 of the image 312 includes generating a first gesture vector from the hand region 322 of the image, each element of the first gesture vector corresponding to a respective gesture and representing a respective first confidence level of the hand region 322 including the respective gesture, determining the first gesture 305 and the first gesture confidence score 306 from the first gesture vector. For example, as described above with reference to FIG. 3, the object detection stage 316 is applied to detect one or more gestures of the image data 312 (after passing through the first detection and classification model 314), which are cropped and used to determine a first gesture vector. In some embodiments, the method 500 further includes associating the detection of the hand region 322 in the image 312 with the bounding box confidence score 304, and combining the bounding box confidence score 304 with the first gesture confidence score 306 (Clsi) of the first gesture 305. For example, as shown above with reference to fig. 3, the output of the classification process 320 may be combined with the output of the object detection stage 316 (e.g., bounding box confidence score).
In some embodiments, the first gesture 305 includes a respective gesture corresponding to a maximum first confidence level of the respective first confidence levels of each element of the first gesture vector, and the gesture confidence score is equal to the maximum first confidence level of the respective first confidence levels of each element of the first gesture vector. In other words, the first gesture 305 may have a greater confidence score (e.g., the first confidence score 306) than other gestures in a corresponding set of one or more gestures (e.g., the first gesture 305 described above with reference to fig. 3).
The method 500 includes determining (508) a second gesture 307 from the image (e.g., the entire image). In some embodiments, determining the second gesture 307 from the image includes generating a second gesture vector from the image (e.g., the entire image), each element of the second gesture vector corresponding to a respective gesture and representing a respective second confidence level of the image including the respective gesture, determining the second gesture 307 and the second gesture confidence score 308 from the second gesture vector.
In some embodiments, the second gesture 307 comprises a respective gesture corresponding to a maximum second confidence level of the respective second confidence levels of each element of the second gesture vector, and the second gesture confidence score 308 is equal to the maximum second confidence level of the respective second confidence levels of each element of the second gesture vector. In other words, the second gesture 307 may have a greater confidence score (e.g., the second confidence score 308) than other gestures in a corresponding set of one or more gestures (e.g., the second set of one or more gestures described above with reference to fig. 3).
The method 500 includes determining (510) that a final gesture of the image is the first gesture 305 in accordance with determining that the first gesture 305 is not any gesture of the plurality of contextual gestures. For example, as shown above with reference to fig. 4, in accordance with a determination that gesture Kcls (e.g., the gesture of first gesture 305) is not contextually relevant (no at operation 430), gesture Kcls is determined to be final gesture 340. In some embodiments, the method 500 includes, prior to determining whether the first gesture 305 is at least one gesture of the plurality of contextual gestures, determining whether the first gesture confidence score 306 is greater than a second threshold P2, and in accordance with determining that the first gesture confidence score 306 is less than the second threshold P2, determining that the image is not associated with any gesture. For example, as described above with reference to fig. 4, in accordance with determining that the confidence score (e.g., cls Kcls) of the gesture Kcls is less than the second threshold probability P2 (no at operation 420), the method 500 determines that no gesture is present.
In some embodiments, the method 500 includes, prior to determining whether the first gesture 305 is at least one gesture of the plurality of contextual gestures, determining whether a second gesture confidence score 308 (DetClsi) of the second gesture 307 is greater than a first threshold P1, and determining that the image is not associated with any gesture based on determining that the second confidence score 308 (DetClsi) of the second gesture 307 is less than the first threshold P1. For example, as described above with reference to fig. 4, in accordance with determining that the confidence score (e.g., detCls Kdet) of the gesture Kdet is less than the first threshold probability P1 (no at operation 450), the method 500 determines that no gesture is present.
The method 500 further includes, in accordance with a determination that the first gesture 305 is one of a plurality of contextual gestures, determining (512) a final gesture based on the second gesture 307 and the second gesture confidence score 308, the second gesture 307 and the second gesture confidence score 308 being associated with the image (e.g., the entire image). In some embodiments, determining the final gesture 340 based on the second gesture 307 and the second confidence score 308 further includes determining that the image 312 is not associated with any gesture in accordance with determining that the first gesture 305 and the second gesture 307 are different from each other. For example, as described above with reference to fig. 4, in accordance with determining that gesture Kdet is different than gesture Kcls (yes at operation 460), method 500 determines that no gesture is present.
Or in some embodiments, in accordance with a determination that the first gesture 305 and the second gesture 307 are the same as each other and in accordance with a determination that the third confidence score 402 does not exceed the integrated confidence threshold (e.g., P3 in fig. 4), the method 500 includes determining that the image is not associated with any gesture. For example, as described above with reference to fig. 4, in accordance with determining that the third confidence score 402 (e.g., detCls Kdet Pbox-confidence score of gesture Kdet multiplied by box confidence score 304 (Pbox) (output of object detection stage 316; fig. 3)) is less than the third threshold probability P3 (no at operation 450), method 500 determines that a gesture is not present. In some embodiments, in accordance with determining that the first gesture 305 and the second gesture 307 are the same as each other and in accordance with determining that the third confidence score 402 exceeds the integrated confidence threshold P3, the method 500 includes determining that the final gesture 340 is the second gesture 307. For example, as described above with reference to fig. 4, in accordance with determining that the third confidence score 402 is greater than or equal to the third threshold probability P3 ("yes" at operation 470), the method 500 determines the second gesture (Kdet) as the final gesture 340.
In some embodiments, the first threshold P1, the second threshold P2, and the integrated confidence threshold P3 are adjusted to obtain optimal performance and are highly dependent on the accuracy of the processes 310 and 320.
In some embodiments, the method 500 further includes filtering the final gesture using a filtering function. The filter function is used to identify false positives. In some embodiments, the filter function is one of a convolution function (or a moving and/or weighted average smoothing function), a fourier filter function, or a kalman filter. In some embodiments, the filtering function is a function of time that allows the determination of the first gesture and/or the second gesture to be stable (e.g., stable for at least 5 consecutive frames). In some embodiments, the filter function helps smooth out the cartridge and identified categories, helps avoid jitter and loss of detection in the process, and helps make it easy to engineer (e.g., implement gesture control for volume adjustment). Additional information about the filter is provided in fig. 3 above.
It should be understood that the particular order of operations in fig. 5 that has been described is merely exemplary and is not intended to indicate that the described order is the only order in which operations may be performed. Those of ordinary skill in the art will recognize various ways of retrieving candidate images or determining camera pose as described herein. Additionally, it should be noted that the details of the other processes described above with respect to fig. 3 and 4 also apply in a similar manner to the method 500 described above with respect to fig. 5. For brevity, these details are not repeated here.
The terminology used in the description of the various embodiments described herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the description of the various embodiments described and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. In addition, it will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element.
As used herein, the term "if" is optionally interpreted to mean "when..once..once more" or "when..once more" or "in response to a determination" or "in response to a detection" or "in accordance with a determination of..once more" depending on the context. Similarly, the phrase "if a determination" or "if a [ stated condition or event ] is detected" is optionally interpreted as meaning "at the time of determination of..once..once the term" or "in response to a determination" or "when a [ stated condition or event ] is detected" or "in response to a detection of a [ stated condition or event ]" or "in accordance with a determination of a [ stated condition or event ] is detected", depending on the context.
The foregoing description, for purposes of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of operation and the practical application, thereby enabling others skilled in the art to practice.
Although the various figures show a plurality of logic stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or fetched. While some reordering or other groupings are specifically mentioned, other ordering and groupings will be apparent to those of ordinary skill in the art, and thus the ordering and groupings presented herein are not an exhaustive list of alternatives. Furthermore, it should be appreciated that these stages may be implemented in hardware, firmware, software, or any combination thereof.