HK1174707A

HK1174707A - Gesture detection and recognition

Info

Publication number: HK1174707A
Application number: HK13101828.1A
Authority: HK
Inventors: S．诺沃辛; P.科利; J．D．J．肖顿
Original assignee: 微软技术许可有限责任公司
Priority date: 2011-03-04
Filing date: 2013-02-08
Publication date: 2013-06-14

Description

Gesture detection and recognition

Technical Field

The invention relates to gesture detection and recognition.

Background

Gesture-based interaction techniques provide an intuitive and natural way for users to interact with computing devices. Many devices and systems provide users with the ability to interact using simple, easily detected gestures, such as a pinch or slide on a touch-sensitive screen. Such gesture-based interaction can greatly enhance the user experience.

However, to support richer or more different gestures, the computational complexity of accurately detecting and recognizing gestures can increase significantly. For example, as the number and/or complexity of gestures increases, the computational complexity involved in detecting gestures may result in noticeable lag between the gesture performed and the action taken by the computing device. In the case of some applications (such as gaming), this lag may negatively impact the user experience.

Furthermore, as the use of gesture-based user interactions becomes more common, more and more diverse users interact in this manner. For example, users performing gestures come from a wider age range and have different levels of experience. This means that different users may perform the same gesture very differently, thereby presenting challenges to the gesture recognition technology to produce consistent and accurate detection.

Furthermore, the use of natural user interfaces is becoming more widespread, where users interact more intuitively with computing devices using, for example, camera-based input or devices for tracking movements of user body parts. Such a natural user interface makes the input of gestures more "free" (i.e., less constrained) than gestures performed, for example, on touch-sensitive screens. This brings more degrees of freedom and variability of gestures, further increasing the need for gesture recognition techniques.

The embodiments described below are not limited to implementations that address any or all of the disadvantages of known gesture recognition techniques.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding to the reader. This summary is not an extensive overview of the invention, and it does not identify key/critical elements of the invention or delineate the scope of the invention. Its sole purpose is to present some concepts disclosed herein in a simplified form as a prelude to the more detailed description that is presented later.

Gesture detection and recognition techniques are described. In one example, a sequence of data items related to a motion of a user performing a gesture is received. A set of data items selected from the sequence is tested against a pre-learned threshold value to determine the probability that the sequence represents a gesture. If the probability is greater than a predetermined value, the gesture is detected and an action is performed. In an example, these tests are performed by a trained decision tree classifier. In another example, a sequence of data items may be compared to a pre-learned template and a degree of similarity determined between them. If the similarity of the template exceeds a threshold, a likelihood value associated with a future time of a gesture associated with the template is updated. Then, when the future time is reached, the gesture is detected if the likelihood value is greater than a predefined value.

Many of the attendant features will be more readily appreciated as the same becomes better understood by reference to the following detailed description considered in connection with the accompanying drawings.

Drawings

The invention will be better understood from the following detailed description when read in light of the accompanying drawings, in which:

FIG. 1 illustrates an exemplary camera-based control system for controlling a computer game;

FIG. 2 shows a schematic diagram of an image capture device;

FIG. 3 shows a schematic diagram of a mobile device;

FIG. 4 illustrates a flow diagram of a process for training a decision forest to recognize gestures;

FIG. 5 illustrates an example portion of a random decision forest;

FIG. 6 illustrates an example threshold test for a data sequence;

FIG. 7 illustrates an example gesture recognition calculation using a decision tree including the test of FIG. 6;

FIG. 8 shows a flow diagram of a process for recognizing gestures using a trained decision forest;

FIG. 9 shows a flowchart of a process for recognizing gestures using a trained logistic model;

FIG. 10 illustrates an example gesture recognition calculation using a trained logistic model; and

FIG. 11 illustrates an exemplary computing-based device in which embodiments of gesture recognition techniques may be implemented.

Like reference numerals are used to refer to like parts throughout the various drawings.

Detailed Description

The detailed description provided below in connection with the appended drawings is intended as a description of the present examples and is not intended to represent the only forms in which the present examples may be constructed or utilized. The description sets forth the functions of the example of the invention, as well as the sequence of steps for constructing and operating the example of the invention. However, the same or equivalent functions and sequences may be accomplished by different examples.

Although the present examples are described and illustrated herein as being implemented in a gaming system or mobile communication device, they are provided by way of example only and not limitation. Those skilled in the art will appreciate that the present examples are suitable for application in a variety of different types of computing systems.

A gesture recognition technique is described herein that allows user gestures to be detected and recognized in a computationally efficient manner with low latency. These gesture recognition techniques may be applied to a natural user interface. For example, fig. 1 and 2 show examples in which a computing device (such as a gaming device) can be controlled with user gestures captured by a camera, while fig. 3 shows an example in which a handheld device can be controlled with user gestures detected by motion and/or orientation sensors.

Referring initially to FIG. 1, FIG. 1 illustrates an exemplary camera-based control system 100 for controlling a computer game. FIG. 1 shows a user 102 playing a boxing game in this illustrated example. In some examples, the camera-based control system 100 may be used to determine body gestures, bind, recognize, analyze, track, associate to a human target, provide feedback, and/or adapt to aspects of a human target such as the user 102, among others.

The camera-based control system 100 includes a computing device 104. The computing device 104 may be a general purpose computer, a gaming system or console, or a dedicated image processing device. The computing device 104 may include hardware components and/or software components such that the computing device 104 may be used to execute applications such as gaming applications and/or non-gaming applications. The structure of computing device 104 is discussed later with reference to FIG. 11.

The camera-based control system 100 further includes a capture device 106. The capture device 106 may be, for example, an image sensor or detector that may be used to visually monitor one or more users (such as the user 102) such that gestures performed by the one or more users may be captured, analyzed, processed, and tracked to perform one or more controls or actions within a game or application, as will be described in more detail below.

The camera-based control system 100 may further include a display device 108 connected to the computing device 104. The computing device may be a television, monitor, High Definition Television (HDTV), or the like that may provide game or application visuals (and optionally audio) to the user 102.

In operation, the user 102 may be tracked using the capture device 106 such that the position, movement, and size of the user 102 may be interpreted by the computing device 104 (and/or the capture device 106) as controls that may be used to affect an application executed by the computing device 104. As a result, the user 102 is able to move his or her body to control the executed game or application.

In the illustrative example of fig. 1, the application executing on the computing device 104 is a boxing game that the user 102 is playing. In this example, the computing device 104 controls the display device 108 to provide a visual representation of a boxing opponent to the user 102. The computing device 104 also controls the display device 108 to provide a visual representation of a user avatar that the user 102 may control with his or her movements. For example, the computing device 104 may include a body pose estimator arranged to recognize and track different body parts of the user and map these parts onto the avatar. In this manner, the avatar replicates the movements of the user 102 such that if the user 102, for example, punches in physical space, this would cause the user avatar to punch in game space.

However, merely replicating user movement in the game space limits the type and complexity of interaction between the user and the game. For example, many in-game controls are momentary actions or commands that, in conventional gaming systems, may be triggered using button presses. Examples of such actions or commands include, for example, punching a fist, shooting, changing weapons, throwing, kicking, jumping, and/or squatting. By recognizing that the user is performing one of these actions and triggering the corresponding in-game action, rather than merely replicating the user's movements, these actions or commands may be more reliably controlled in many applications.

Control of a computing device, such as a gaming system, also includes input to control many actions other than an avatar. For example, commands are used to control menu selection, back/exit, turn the system on or off, pause, save games, communicate with friends, and so forth. Further, controls are used to interact with applications other than games, for example to enter text, select icons or menu items, control media playback, browse websites, or operate any other controllable aspect of an operating system or application.

These commands and actions cannot be controlled by merely reflecting the user's movements. Rather, higher level processing analyzes these movements to detect whether the user is performing a gesture corresponding to one of these commands or actions. If a gesture is recognized, a corresponding action or command may be performed. However, requirements are imposed on the gesture recognition system in terms of speed and accuracy of gesture detection. This affects the usability of the computing device if there is a lag or latency. Similarly, if gestures are inaccurately detected, this can also negatively impact the user experience. The following describes a gesture recognition technique that allows for fast and accurate detection and recognition of gestures.

Referring now to FIG. 2, FIG. 2 shows a schematic diagram of a capture device 106 that may be used in the camera-based control system 100 of FIG. 1. In the example of FIG. 2, capture device 106 is configured to capture video images with depth information. Such a capture device may be referred to as a depth camera. The depth information may be in the form of a depth image comprising depth values, i.e. a depth value is a value associated with each image element of the depth image which value is related to the distance between the depth camera and an item or object located at the image element. Note that the term "image element" is used to refer to a pixel, group of pixels, voxel (voxel), or other higher level component of an image.

The depth information may be obtained using any suitable technique, including, for example, time-of-flight, structured light, stereo images, and the like. In some examples, the capture device 106 may organize the depth information into "Z layers," or layers that may be perpendicular to a Z axis extending from the depth camera along its line of sight.

As shown in FIG. 2, the capture device 106 includes at least one imaging sensor 200. In the example shown in fig. 2, the imaging sensor 200 includes a depth camera 202 arranged to capture a depth image of a scene. The captured depth image may include a two-dimensional (2-D) area of the captured scene where each image element in the 2-D area represents a depth value, such as a length or distance of an object in the captured scene from depth camera 202.

The capture device may also include a transmitter 204 arranged to illuminate the scene in a manner that depth information may be ascertained by the depth camera 202. For example, where the depth camera 202 is an Infrared (IR) time-of-flight camera, the emitter 204 emits IR light onto the scene, and the depth camera 202 is arranged to detect backscattered light from the surface of one or more targets and objects in the scene. In some examples, pulsed infrared light may be emitted from the emitter 204 such that the time between an outgoing light pulse and a corresponding incoming light pulse may be detected and measured by a depth camera and used to determine a physical distance from the capture device 106 to a location on a target or object in the scene. Additionally, in some examples, the phase of the outgoing light wave from the emitter 204 may be compared to the phase of the incoming light wave at the depth camera 202 to determine a phase shift. The phase shift may then be used to determine a physical distance from the capture device 106 to a location on the targets or objects. In future examples, time-of-flight analysis may be used to indirectly determine a physical distance from the capture device 106 to a location on the targets or objects by analyzing the intensity of the reflected beam of light over time via various techniques including, for example, shuttered light pulse imaging.

In another example, the capture device 106 may use a structured light to capture depth information. In such techniques, a patterned light (e.g., light displayed as a known pattern such as a spot, grid, or bar pattern, which may also vary over time) may be projected onto the scene using the emitter 204. After striking the surface of one or more objects or objects in the scene, the pattern is deformed. Such pattern deformation may be captured by depth camera 202 and then analyzed to determine a physical distance from capture device 106 to a location on a target or object in the scene.

In another example, depth camera 202 may be in the form of two or more physically separated cameras viewing a scene from different angles in order to obtain visual stereo data that can be resolved to generate depth information. In this case, the emitter 204 may be used to illuminate the scene, or may be omitted.

In some examples, in addition to or instead of depth camera 202, capture device 106 may include a conventional video camera known as an RGB camera 206. The RGB camera 206 is arranged to capture a sequence of images of a scene at visible light frequencies and can therefore provide images that can be used to augment the depth image. In some examples, an RGB camera 206 may be used instead of the depth camera 202. The capture device 106 may also optionally include a microphone 207 or microphone array (which may be directional and/or steerable) arranged to capture sound information (such as speech input from the user) and may be used for speech recognition.

The capture device 106 shown in fig. 2 further includes at least one processor 208, the processor 208 in communication with the image sensor 200 (i.e., the depth camera 202 and the RGB camera 206 in the example of fig. 2), the emitter 204, and the microphone 207. The processor 208 may be a general purpose microprocessor, or a dedicated signal/image processor. The processor 208 is arranged to execute instructions to control the imaging sensor 200, the emitter 204, and the microphone 207 to capture depth images, RGB images, and/or voice signals. The processor 208 may also optionally be arranged to perform processing on these images and signals, as outlined in more detail below.

The capture device 106 shown in fig. 2 further includes a memory 210, the memory 210 being arranged to store instructions executed by the processor 208, images or frames of images captured by the depth camera 202 or the RGB camera 206, or any other suitable information, images, or the like. In some examples, memory 210 may include Random Access Memory (RAM), Read Only Memory (ROM), cache, flash memory, a hard disk, or any other suitable storage component. The memory 210 may be a separate component in communication with the processor 208 or may be integrated into the processor 208.

The capture device 106 also includes an output interface 212 in communication with the processor 208 and arranged to provide data to the computing device 104 via a communication link. The communication link may be, for example, a wired connection (such as a USB, firewire, ethernet or similar connection) and/or a wireless connection (such as a WiFi, bluetooth or similar connection). In other examples, the output interface 212 may interface with one or more communication networks (such as the internet) and provide data to the computing device 104 via these networks.

The computing device 104 performs a number of functions related to camera-based gesture recognition, such as an optional body pose estimator 214 and a gesture recognition engine 216, as described in more detail below. The body pose estimator 214 is arranged to use computer vision techniques to detect and track different body parts of the user. The body pose estimator 214 may provide an output to the gesture recognition engine in the form of a time series of data related to the body pose of the user. This may be in the form of a fully tracked skeletal model of the user, or a coarser identification of the visible body parts of the user. For example, the time series of sequences may include data related to a time-varying angle between at least two body parts of the user, a rate of change of the angle between at least two body parts of the user, a speed of movement of at least one body part of the user, or a combination thereof. Different types of data (angles between certain body parts, velocities, etc.) are referred to as "features". In other examples, the body pose estimator 214 may derive other user sequences (i.e., other features) from user poses that change over time. In further examples, the gesture recognition engine 216 may derive the input (i.e., features) using a different source than the body pose estimator. Application software 218 may also be executed on computing device 104 and controlled using the gestures.

Referring now to FIG. 3, FIG. 3 illustrates an example handheld or mobile device 300 that may be controlled by gestures. In one example, the mobile device of fig. 3 may be a mobile phone or other mobile computing or communication device. Interaction with such mobile devices involves the use of commands such as, for example, navigating to contact details of an entity, launching an application, calling a person, placing the device in a different mode of operation (mute, vibrate, out of home, etc.), answering a call, and many other commands. The gesture recognition techniques described herein allow users to make these commands through motion-based gestures. In other words, the user can control the mobile device 300 by moving the mobile device in some manner.

The mobile device 300 includes one or more sensors that provide information related to the motion, orientation, and/or position of the mobile device 300. In the example of fig. 3, the mobile device 300 includes an accelerometer 302, a gyroscope 304, an Inertial Measurement Unit (IMU)306, and a compass 308, where the accelerometer 302 measures appropriate acceleration of the device in one or more axes, the gyroscope 304 can determine an orientation of the mobile device, the IMU 306 can measure both acceleration and orientation, and the compass 308 can measure a direction of the mobile device. In other examples, the mobile device 300 can include any combination of one or more of these sensors.

The sensors provide information to the computing device 104 in the form of a sequence of data items relating to the motion or orientation of the mobile device over time. Note that the computing device 104 may be integrated into the mobile device 300, or in other examples, may be located at a remote location. The computing device 104 executes a gesture recognition engine 216, the engine 216 being arranged to interpret information about the motion and/or orientation (i.e., features) of the mobile device and to recognize gestures made by the user. Commands from the gesture recognition engine 216 control application software 218 executing on the computing device 104, such as those mentioned above. The mobile device 300 may also include a display device 310, such as a screen for displaying information to a user, and one or more input devices 312, such as touch sensors or buttons.

Two techniques for detecting and recognizing gestures that may be applied to a natural user interface in scenarios such as those described above are described below. Note that these gesture recognition techniques may also be applied in many other scenarios, in addition to the camera-based and motion-based examples described above. The first technique described with reference to fig. 4 to 8 is based on the use of machine learning classifiers. The second technique described with reference to fig. 9 and 10 is based on a trained logistic model.

As mentioned, a first gesture recognition technique described herein utilizes a machine learning classifier to classify gestures and act accordingly. A machine learning classifier as used herein is a random decision forest. However, in other examples, alternative classifiers, such as support vector machines, boosting, may also be used. In a further example, rather than using a decision forest, a single trained decision tree may be used (this is equivalent to a forest in the following explanation with only one tree). In the following description, the process of training a decision forest for a machine learning algorithm is discussed first with reference to fig. 4 to 6, and the process of using the trained decision tree to classify (recognize) a gesture using the trained algorithm is discussed second with reference to fig. 7 and 8.

The decision tree is trained using a set of annotated training sequences. The annotated training sequence comprises a sequence of data items corresponding to those seen during operation of the gesture recognition technique. However, the training sequence is annotated to classify each data item.

The sequence of data items may describe various different features that may be interpreted by gesture recognition techniques. For example, such features include, but are not limited to:

● angles between two or more body parts derived from the body pose estimator;

● rate of change of angle between two or more body parts derived from the body pose estimator;

● velocity of one or more body parts tracked using the body pose estimator;

● inertia, acceleration, or orientation of the mobile device;

● a speech signal from a microphone or speech recognizer;

● raw depth image features (i.e., not from the body pose estimator), such as optical flow in depth and/or velocity of tracked feature points;

● raw RGB image characteristics, such as statistics of optical flow in the RGB image;

● based on features of the body pose estimator output combined with the original depth image, such as the time derivative of the body part probability; or

● any combination of these features.

Each gesture has one or more points in time at which a command or action to which the gesture relates is triggered. This is referred to as the "action point" of the gesture, and represents the end of the gesture motion (or an identifiable point during the gesture motion) (e.g., the vertex of a punch gesture). The data item at the action point of the gesture and its temporal history are classified as belonging to the gesture, and all other data items are classified as "background". The trained set of sequences may include sequences relating to a plurality of different poses and may include data sequences relating to different measurements or combinations of measurements (e.g., angles, velocities, accelerations, etc.). In some examples, the training sequence may be perturbed by randomly time warping or adapting to the feature (e.g., retargeting a skeleton-based feature from the body pose estimator to a different size skeleton).

FIG. 4 illustrates a flow diagram of a process for training a decision forest to recognize gestures. First, the set of annotated training sequences is received 402 and the number of decision trees to be used in the random forest of decision trees is selected 404. A random decision forest is a collection of deterministic decision trees. Decision trees may be used in classification algorithms, but may suffer from overfitting, which results in poor generalization (generalization). However, the ensemble of many randomly trained decision trees (a random forest) yields improved generalization. The number of trees is fixed during the training process. In one example, the number of trees is ten, although other values may be used.

The following comments are used to describe the training process for gesture recognition. Forest represented by Ψ₁，...，Ψ_t，...，Ψ_TWhere T indexes each tree. FIG. 5 illustrates an example random decision forest. The illustrative decision forest of FIG. 5 includes three decision trees: first, theA tree 500 (denoted as tree Ψ)₁) (ii) a Second tree 502 (denoted as tree Ψ)₂) (ii) a And a third tree 504 (denoted as tree Ψ)₃). Each decision tree includes a root node (e.g., root node 506 of first decision tree 500), a plurality of internal nodes referred to as partition nodes (e.g., partition nodes 508 of first decision tree 500), and a plurality of leaf nodes (e.g., leaf nodes 510 of first decision tree 500).

In operation, each root and split node of each tree performs a binary test on the input data and directs the data to the left or right child nodes based on the results thereof. The leaf node does not perform any action; they store only probability distributions (e.g., the example probability distribution 512 for a leaf node of the first decision tree 500 of fig. 5), as described below.

The manner in which the parameters used by each split node are selected and how the leaf node probabilities are calculated is now described with reference to the remainder of figure 4. A decision tree (e.g., first decision tree 500) is selected 406 from the forest of decision trees, and a root node 506 is selected 408. All annotated sequences from the training set are then selected. Each data item in the sequence in the training set (and its associated temporal history) x is associated with a known class label (denoted by y (x)). Thus, for example, y (x) indicates whether data item x relates to an action point of a gesture class of punch, kick, jump, shoot, call, select, answer, exit, or background, where the background class flag indicates that data item x does not relate to a defined gesture action point.

A random set of test parameters is then generated 410 for use in the binary test performed at the root node 506. In one example, the binary test is of the form: ξ > f (x; θ) > τ, such that f (x; θ) is a function applied to data item x in the sequence, the function having a parameter θ, and the output of the function being compared to threshold values ξ and τ. The result of this binary test is true if the result of f (x; θ) is in the range between ξ and τ. Otherwise, the result of the binary test is false. In other examples, only one of the threshold values ξ and τ may be used so that the result of this binary test is true if the result of f (x; θ) is greater than (or alternatively, less than) the threshold value. In the examples described herein, the parameter θ defines the offset of a point in time in the sequence from the current data item, and optionally the type of feature read from the sequence at that offset. An example function f (x; θ) is described below with reference to FIG. 6.

The result of the binary test performed at the root node or split node determines to which child node the data item is passed. For example, if the result of the binary test is true, the data item is passed to the first child node, and if the result is false, the data item is passed to the second child node.

The generated random set of test parameters includes a plurality of random values of the function parameter θ and threshold values ξ and τ. To inject randomness into the decision tree and reduce computation, the function parameters θ of each split node are optimized only with respect to a randomly sampled subset Θ of all possible parameters. For example, the size of the subset Θ may be five hundred. This is an efficient and simple way to inject randomness into the tree and increase summarization, while avoiding computationally intensive searches over all possible tests.

Each combination of test parameters is then applied to each data item in the training set. In other words, for each annotated data item in each training sequence, all available values of θ (i.e., θ) are tried one after the other, in combination with all available values of ξ and τ_iE.g. Θ). For each combination, an information gain (also referred to as relative entropy) is calculated. Selecting a combination of parameters (denoted as θ) that maximizes the information gain^*、ξ^*And τ^*)414 and stored in association with the current node for later use. As an alternative to the information gain, other criteria may be used, such as Gini entropy, or the "double-ing" criterion.

It is then determined 416 whether the value for the maximized information gain is less than a threshold. If the value for the information gain is less than the threshold, this indicates that further expansion of the tree does not provide significant benefit. This results in an asynchronous tree that naturally stops growing when no more nodes are available. In this case, the current node is set 418 as a leaf node. Similarly, the current depth of the tree (i.e., how many levels of nodes are between the root node and the current node) is determined 416. If this value is greater than the predefined maximum value, then the current node is set 418 as a leaf node. In one example, the maximum tree depth may be set to 15 levels, although other values may be used.

If the value for maximized information gain is greater than or equal to the threshold and the depth of the tree is less than the maximum value, then the current node is set 420 as the split node. When the current node is a split node, it has child nodes, and the process then moves to training these child nodes. Each child node is trained at the current node using a subset of the data items. Using a parameter θ that maximizes information gain^*、ξ^*And τ^*To determine the subset of data items that are sent to the child node. These parameters are used in a binary test and the binary test is performed 422 on all data items at the current node. The data items that pass the binary test form a first subset that is sent to the first child node, while the data items that do not pass the binary test form a second subset that is sent to the second child node.

For each of the child nodes, the process outlined in blocks 410 to 422 of FIG. 4 is recursively performed 424 for the subset of training data items for the respective child node. In other words, for each child node, new random test parameters are generated 410 and applied 412 to the respective subset of data items, the parameter that maximizes the information gain is selected 414, and the type of node (split node or leaf node) is determined 416. If it is a leaf node, the current recursive branch stops. If it is a split node, a binary test is performed 422 to determine more subsets of data items and another recursive branch begins. Thus, this process traverses the tree in a recursive manner, training each node until a leaf node is reached at each branch. When a leaf node is reached, the process waits 426 until the nodes in all branches have been trained. Note that in other examples, alternative recursive techniques may be used to achieve the same functionality. For example, one alternative technique is to train "breadth first" in which the entire hierarchy of the tree is trained once, i.e., the size of the tree is doubled in each iteration.

Once all the nodes in the tree are trained to determine the parameters of the binary test that maximize the information gain at each split node, and leaf nodes have been selected to terminate each branch, the probability distributions for all the leaf nodes of the tree can be determined. This is achieved by counting 428 class labels of training data items that reach each of the leaf nodes. All training data items stop at the leaf nodes of the tree. When each training data item has a class label associated with it, the total number of training data items in each class may be counted at each leaf node. From the number of training data items in each class at a leaf node and the total number of training data items at the leaf node, a probability distribution for the class at the leaf node may be generated 430. To generate the distribution, the histogram is normalized. Optionally, a small previous count may be added to all classes so that no class is assigned a zero probability, which improves the generalization.

Figure 5 shows an example probability distribution 512 for a leaf node 510. The probability distribution shows the probability that class c of a data item has compared to data items belonging to that class at that leaf node, the probability being expressed asWherein l_tIndicating the leaf node l of the t-th tree. In other words, the leaf nodes store the posterior probabilities on the trained classes. Such a probability distribution can therefore be used to determine the likelihood that a data item arriving at that leaf node includes an action point of a given gesture class, as described in more detail below.

Returning to FIG. 4, once the probability distributions for the leaf nodes of the tree are determined, a determination 432 is made as to whether there are more trees in the decision forest. If so, the next tree in the decision forest is selected and the process repeats. If all of the trees in the forest have been trained, and no other trees remain, the training process is complete and the process terminates 434.

Thus, as a result of this training process, a plurality of decision trees are trained using the training sequence. Each tree includes a plurality of split nodes that store optimized test parameters and leaf nodes that store associated probability distributions. Since the parameters are randomly generated from a finite subset used at each node, the trees in a forest of trees are distinguished from each other (i.e., different).

An example test of the form ξ > f (x; θ) > τ of a sequence with three random test parameter sets is shown with reference to FIG. 6. FIG. 6 shows an example sequence of gestures completed in an action point. The sequence 600 of fig. 6 is shown with feature values 602 (such as one or more of joint angle, velocity, inertia, depth/RGB image features, audio signals, etc.) against time 604, where an action point 606 occurs at a current time 608. For example, if the sequence 600 relates to a punch gesture, the data item at action point 606 has the classification "punch" and all other data items have the classification "background".

Note that the example of fig. 6 only shows a sequence of values comprising a single feature for clarity, while in other examples the sequence may have data items describing several different features in parallel, each feature having an associated feature value. In examples where there are several features, the parameter θ at the node may also identify which feature to analyze for a given test, but this is not shown in fig. 6. Note also that the sequence includes (in most examples) discrete samples of the underlying continuous signal. Further, the samples may not be received at uniform time intervals. The sequence may therefore use interpolation (e.g., nearest neighbor, linear, quadratic, etc.) with compensation for different time intervals.

In the first example test, the sequence index value θ₁610 have been randomly generated, the index value corresponding to a point in time along the sequence. This may be expressed as from the current time 608Offset (action point 606 in this example). To calculate f (x; theta) of the sequence₁) Find the index value theta₁610, the characteristic value of the data item at the time instance.

FIG. 6 also shows a pair of randomly generated thresholds ξ₁612 and τ₁614. Threshold xi₁612 and τ₁614 denotes the sum index value theta₁610 associated eigenvalue thresholds. Therefore, when the index value θ₁The characteristic value of the data item at 610 is less than the threshold ξ₁612 and is greater than a threshold τ₁614, the test xi is passed₁＞f(x；θ₁)＞τ₁(as is the case in the example of fig. 6).

In a second example test, the second sequence index value θ₂616 and a threshold ξ₂618 and τ₂620 have been randomly generated. As above, when the index value θ₂The characteristic value of the data item at 616 is at a test threshold ξ₂618 and τ₂In between 620, sequence 600 tests ξ by example₂＞f(x；θ₂)＞τ₂. Similarly, in a third example, a third sequence index value θ₃622 and a threshold ξ₃624 and tau₃626 have been randomly generated. Thirdly, when the index value theta₃Characteristic value of data item at 622 is at test threshold ξ₃624 and tau₃While in between 626, sequence 600 tests ξ by example₃＞f(x；θ₃)＞τ₃。

If these three randomly generated binary tests are used in the decision tree, all three binary tests are satisfied by the sequence of three ranges defined by the parameters, and (in this example) may have a high probability of being the same gesture as occurs at action point 606. It is clear that the present example shows only some of the large number of possible combinations of index values and thresholds and is merely illustrative. This shows, however, how the similarity between sequences can be captured by considering whether representative or distinctive points in the sequences are within a threshold.

If during the training process described above the algorithm is to select three sets of random parameters shown in fig. 6 to use at three nodes of the decision tree, these parameters can be used to test the sequence as shown in fig. 7. FIG. 7 shows a decision tree with three levels, using the example test of FIG. 6. The training algorithm has selected a first set of parameters θ from the first example of FIG. 6₁、ξ₁And τ₁As a test applied at the root node 700 of the decision tree of fig. 7. As described above, the training algorithm chooses this test because it has the greatest information gain of the training image. The current data item x of the sequence (i.e., the most recently received data item) is applied to root node 700 and a test is performed on this data item. If the sequence 600 of FIG. 6 is used as an example, it can be seen that at θ₁Where the term is at the threshold xi₁And τ₁And thus the result of the test is true. If the pair is at theta₁Is located at a threshold xi₁And τ₁The other sequences perform the test and the result will be false.

Thus, when the sequence of gesture-related data items is applied to the trained decision tree of FIG. 7, it has a value at θ₁Where has a threshold xi₁And τ₁The sequence of data items (i.e., that pass the binary test) of values in between are passed to the child partition node 702, while the sequence that does not pass the binary test is passed to other child nodes.

The training algorithm has selected a second set of test parameters θ from the second example of FIG. 6₂、ξ₂And τ₂As a test applied at the split node 702. As shown in FIG. 6, the sequence that passes this test is at an index value θ₂Is located at a threshold xi₂And τ₂Those sequences that pass between. Thus, in the case where only the sequence that passes the binary test associated with the root node 700 reaches the split node 702, the sequence that passes this test is at θ₁Is located at a threshold xi₁And τ₁Is again in theta₂Is located at a threshold xi₂And τ₂Pass throughThose sequences of (a). The sequence that passes the test is provided to the segmentation node 704.

The training algorithm has selected a third set of test parameters θ from the third example of FIG. 6₃、ξ₃And τ₃As a test applied at split node 704. FIG. 6 shows only the signal having a value at θ₃Is in xi₃And τ₃Those sequences of characteristic values in between pass this test. Only sequences that pass the tests at the root node 700 and the split node 702 reach the split node 704, so the sequences that pass the tests at the split node 704 are those that fall between each of the thresholds shown in fig. 6 (such as the example sequence 600 of fig. 6). The sequence that passes the test at the split node 704 is provided to the leaf node 706.

Leaf node 706 stores probability distributions 708 for different gesture classes. In this example, the probability distribution indicates a high probability 710 that the data item that reached this leaf node 706 is the data item corresponding to the action point of the punch gesture. It is to be understood that the learning algorithm may rank the tests arbitrarily, and the evaluated features need not be in chronological order.

In the examples of fig. 6 and 7 described above, each of these tests can be performed when the sequence being tested contains enough data items to perform the test in question. However, in some cases, the tree may be trained such that tests are used in nodes that cannot be applied to a certain sequence. For example, if the sequence being tested has a small number of data items, a test cannot be performed using an index value θ corresponding to the time from the current data item back to the past and which is greater than the number of data items. In this case, no test is performed and the current data item is sent to both child nodes, so that more tests lower in the tree can also be used to obtain results. This result can be obtained by using all the leaf nodes reached. In an alternative example, to avoid testing for short sequences, a maximum feature time window (e.g., a 1 second time window) may be defined, and classification is not performed until sufficient readings are obtained (e.g., the first second of the sequence is ignored).

It is clear that fig. 6 and 7 provide simplified examples, and that in practice a trained decision tree may have many more levels (and thus many more sample points along the sequence). Furthermore, in practice, many decision trees are used in a forest, and the results are combined to increase accuracy, as outlined below with reference to fig. 8.

FIG. 8 shows a flow diagram of a process for identifying features in a previously unseen sequence using a decision forest that has been trained as described above. First, a new data item is received 800 at a gesture recognition algorithm. The goal of the algorithm is to classify the new data item as an action point or background for a specified gesture. The sequence is generated 802 by forming a time series from the new data item and a plurality of previously received data items already stored at the storage device. The length of the sequence to be generated may be predefined. In one example, the algorithm may be arranged to generate a sequence of 30 data items, however any suitable value may be used. In some examples, the sequence may be shorter when there are not enough previous data items received. This sequence may be referred to as "unseen" to distinguish it from training images having data items that have been manually classified.

Note that, as mentioned above, some examples may utilize sequences based on multiple features formed concurrently. For example, the sequence may include data items describing both the angle between the joints of the user and the velocity of the body part of the user. In such an example, the test parameter θ at each node of the tree specifies which feature is to be tested against the threshold.

A trained decision tree from the decision forest is selected 804. The new data item and its associated sequence are pushed 806 through the selected decision tree (in a manner similar to that described above with reference to figure 6) so that the trained parameters are tested against at a node and then passed on to the appropriate children depending on the results of the test, and the process repeats until the new data item reaches a leaf node. Once the new data item reaches a leaf node, the probability distribution associated with this leaf node is stored 808 for the new data item.

If it is determined 810 that there are more decision trees in the forest, a new decision tree is selected 804, the new data item is pushed 806 through the tree and the probability distribution is stored 808. This process is repeated until the process is performed for all decision trees in the forest. Note that the process for pushing new data items through multiple trees in a forest of decision trees may also be performed in parallel, rather than in sequence as shown in FIG. 8.

Once the new data item and its sequence have been pushed through all of the trees in the decision forest, a plurality of gesture classification probability distributions (at least one per tree) are stored for the new data item. These probability distributions are then aggregated 812 to form an overall probability distribution for the new data item. In one example, the total probability distribution is the average of all individual probability distributions from the T different decision trees. This is given by:

note that alternative methods of combining the tree a posteriori probabilities other than averaging, such as multiplying the probabilities, may also be used. Optionally, an analysis of the variability between individual probability distributions may be performed (not shown in fig. 8). Such an analysis can provide information about the uncertainty of the overall probability distribution. In one example, a standard deviation may be determined as a measure of variability.

Once the overall probability distribution for the new data item is found, the probability for each classification is compared 814 to a threshold value associated with each classification. In one example, a different threshold may be set for each gesture classification.

If it is determined 816 that the probability for each classification is not greater than its associated threshold value, or indicates that the new data item has a "background" classification, the process waits for the next new data item to be received and repeats. However, if it is determined 816 that the probability of a classification is greater than its associated threshold value and indicates that the new data item is not "background," this indicates that the new data item corresponds to an action point of a gesture. A gesture corresponding to category c is detected 818 and the function, command or action associated with this gesture is performed 820, as described above.

Thus, the gesture recognition algorithm described above allows classification of a newly received, unseen data item relating to a feature as an action point of a gesture by using information of a previously received data item in the sequence. Random decision forests provide a useful technique for training classifiers and applying tests to previously received data items. Although the process for training the decision tree is relatively complex, the evaluations performed to classify new data items are relatively lightweight in terms of processing and can be performed quickly to minimize recognition lag.

Recognition lag can be further reduced, if desired, by training the decision tree and then hard-coding the optimized test into a set of instructions. For example, the test performed by the trained decision tree may be written out as a C program and then compiled by the processor. The compilation optimizes the C program for the processor used and makes decision tree classification fast to execute.

Reference is now made to fig. 9 and 10, which illustrate alternative techniques for sequence-identifying features of data items, such as those described above. This technique is based on a trained logistic model. The operation of the technique is described first, followed by a description of the training of the model.

In one example, the model used is the following log-linear logistic model:

wherein x_(t-W+1):tIs a sequence of data items having a length W extending backwards from the current time t (similar to that described above with reference to the decision forest example), phi (x)_(t-W+1):t) Is a characteristic function (described in more detail below) for testing the sequence, and p_g(t) is the probability that the data item from time t is the action point of gesture g. By learning appropriate weightsThese features may be weighted and combined to produce an overall gesture probability.

The operation of the present technique when using this model is described with reference to fig. 9. First, a new data item is received 900 at a gesture recognition algorithm. The sequence is generated 902 by forming a time series from the new data item and a plurality of previously received data items already stored at the storage device. The length of the sequence is denoted W, as described above.

For example, referring to fig. 10, an illustrative sequence 1000 is shown that includes a new data item received at a current time 1002 and a plurality of previously received data items, each having a characteristic value. Note that the example of fig. 10 shows only a single type of feature in the sequence, however in other examples the sequence may include a plurality of different features. Then testing the characteristic function phi (x)_(t-W+1):t) Is applied to the sequence. In this example, this testing includes comparing 904 the generated sequence to a set of predefined stored templates 906, which have previously been selected in the training process to provide an indicative example of a gesture (also shown in FIG. 10). A measure of similarity between the sequence and each of the stored templates is determined. In one example, the measure of similarity used is the euclidean distance between the template and the sequence. The result of the test is then found by determining 908 whether the similarity is greater than a predefined threshold.

More formally, each of these templates has a set of parameters associated with it. For example, each template may include parameters ω, A, M, r, upsilon, and w_fWhere ω defines the length of the template sequence, A defines the feature to be tested, M is the template pattern itself, r is a threshold value used to determine 908 whether the template is sufficiently similar to the sequence, v is the distance to the future (described in more detail below) that the weight or vote is to cast, and w is_fIs the weight associated with the template. Then, the test characteristic function phi (x) of the template f is given_(t-W+1):t) The results of (a) are given by:

wherein symbol-_ARepresents the projection on the set of features a, and D (·,) is the distance function (such as the euclidean distance) between the sequence and the template.

If it is determined 908 that the similarity between the sequence and the template is not greater than the threshold (i.e., +, phi)_f(x_(t-W+1):t) 0), the process waits for the next data item to be received and repeats. However, if it is determined 908 that the similarity between the sequence and the template is greater than the threshold (i.e., φ -_f(x_(t-W+1):t) 1), the future time associated with the similar template is calculated 910. This is calculated using the above-mentioned parameter v added to the current time. For example, referring to FIG. 10, the future time 1004 is calculated from template-like parameters.

The gesture recognition algorithm maintains a gesture likelihood list 1006, which is a time series list that stores values corresponding to calculated likelihoods of a given gesture action point occurring at a given time. Weighting w of the template_fAdd 912 to the gesture likelihood list at the future time 1004. In other words, as shown in FIG. 10, the weight w_fIs projected 1008 into the future and is aggregated with any existing likelihood values at a future time 1004 to give an updated likelihood value 1010 for the feature associated with the template.

The gesture likelihood list 1006 is then read 914 to determine a gesture likelihood value (such as likelihood value 1012) for the current time t. If it is determined 916 that one of these gesture likelihood values is greater than the predefined value, the algorithm considers the current time t to be the action point for this gesture, and the gesture is detected and may be acted upon, as outlined above. On the other hand, if it is determined 916 that none of the gesture likelihood values is greater than the predefined value, the algorithm detects no gesture and waits for a new data item and repeats.

Thus, by projecting weights for future times, the present algorithm ensures that the likelihood of a gesture when executed is exploited such that when an action point occurs, the main part of the processing for gesture detection has been completed and thus the gesture can be detected with low latency. For example, looking at FIG. 10, if the predefined value 1014 for detecting a gesture is set as shown, the weight w projected 1008 to the future time 1004_fIt is sufficient to make the updated likelihood value 1010 greater than the predefined value 1014. Thus, when the future time 1004 is reached, it has been determined that sufficient evidence has been seen to indicate that the future time 1004 is the action point of the gesture, and that this gesture can be quickly detected.

The scheme of projecting weights into the future enables the gesture recognition system to provide intermediate results to the application. For example, in the case of computer games, this allows the game designer to know in advance whether a possible gesture is coming and thus provide some feedback to the user. For example, if a user begins punching a punch, the computer game graphics may display a virtual punch line in the game to encourage you to complete the punch. This may be arranged to fade in (fade in) when the gesture becomes more trustworthy (i.e., more weight is projected for the gesture). This enables the user to more easily find gestures when they are displayed dimly on the screen during background movement.

To allow for the above gesture recognition algorithm, parameters of the template are learned using machine learning techniques based on annotated training data similar to that described above with reference to the decision forest example. In one example, the learning of weights may be projected as a logistic regression problem. As such, in one example, the parameters of the template may be learned by sampling a random set. In an alternative example, to increase the chance of selecting a good set, similar gains may be usedA strong (boosting-like) approach, yielding good features on demand. Furthermore, to ensure that only a few features receive a non-zero weight w_fThen sparsity-inducing norm (sparse-inducing norm) on w can be used. This causes the following learning problem:

wherein T is_iIs the length of the ith training sequence, and C_i，t0 determines the relative importance of correctly predicting the t-th term in sequence i. After solving the learning problem described above, the per-data item weight (per-data itemweight) can be derived as:

this allows the definition of a so-called "pricing problem" to find features to be added that guarantee an improved classifier, as measured on the target values in equation (1):

this is an optimization problem for the hypothesis space F of all possible values of the template parameters. Solving this pricing problem enables the feature to be found that when added to F' (a subset of F) minimizes the goal of equation (1). For example, if f^*Is the minimizer of equation (2), then the target is reduced if:

in one example, equation (2) can be optimized by randomly selecting the parameters ω, a, and υ. The template mode M is taken from before the gesture action point in the training sequence and r is explicitly optimized for it. This provides an approximation to the solution of equation 2. The selected number of the frontmost features is maintained (such as the first few hundred) and used to solve equation (1).

Both of the above techniques (based on decision trees and logistic models) thus provide gesture recognition techniques that allow for low latency detection and recognition of user features in a computationally efficient manner. These gesture recognition techniques have been illustrated in the exemplary context of static devices (such as gaming devices) that can be controlled with user gestures captured by a camera, and mobile handheld devices (such as mobile phones) that can be controlled with user gestures detected by motion and/or orientation sensors within the device.

In a further example, these two contexts may be combined. For example, a handheld mobile device may communicate with a gaming device such that features from the handheld device (e.g., inertial sensors) can be combined with features from a camera (e.g., body part features). This may be used to allow fine motion details to be captured from handheld device sensors and incorporated into gesture recognition (which may be performed in a static device). As an illustrative example, this may be used in a bowling game in which a user holds a mobile device in his hand while playing a bowling ball to capture the user's broader, coarse movements using a camera, while fine movements representing, for example, rotation on the ball are captured using a handheld device sensor. These separate signals are used together for gesture recognition to control the operation of the bowling game.

This may be performed by the mobile device reporting a continuous stream of sensor data to the static device, which uses both types of features to perform gesture recognition. In an alternative example, the recognition algorithm may be run separately on both the mobile device and the static device, and the mobile device may be arranged to send discrete recognized gesture signals back to the static device once a gesture is recognized, thereby reducing bandwidth usage.

In the above example, the gesture has an action point only at the completion of the gesture. However, in other examples, gesture recognition techniques may be used to detect gestures that include several action points. For example, if the gesture comprises a circular rotation of the handheld device, this may be divided into four separate sub-gestures, each corresponding to movement through one of four points on a compass. Thus, a "ring" gesture can be considered to have four action points. Each of these sub-gestures is detected in a sequence to trigger an overall "ring" gesture. Each of these sub-gestures may be trained as a class in the classifier described above and recognized. The classification output from the classifier may be provided to a state machine that may trigger the entire gesture when state machine conditions are met by combining the detected sub-gestures in a defined manner.

FIG. 11 illustrates various components of an exemplary computing device 104, which may be implemented as any form of a computing device and/or electronic device in which embodiments of the above-described gesture recognition techniques may be implemented.

Computing device 104 includes one or more processors 1102, which may be microprocessors, controllers, or any other suitable type of processor for processing computer-executable instructions to control the operation of the device in order to perform gesture recognition. In some examples, such as in examples using a system-on-a-chip architecture, the processor 1102 may include one or more fixed function blocks (also known as accelerators) that implement portions of the gesture recognition method in hardware (rather than software or firmware).

The computing-based device 104 also includes an input interface 1104 arranged to receive input from one or more devices, such as the capture device 106 of FIG. 2 or the sensor of FIG. 3. An output interface 1106 is also provided and arranged to provide output to, for example, a display system (such as display device 108 or 310) integrated with or in communication with the computing-based device. The display system may provide a graphical user interface, or any other suitable type of user interface, but this is not required. A communication interface 1108 may optionally be provided, which may be arranged to communicate with one or more communication networks (e.g., the internet).

Computer-executable instructions may be provided using any computer-readable media accessible by computing-based device 104. Computer-readable media may include, for example, computer storage media such as memory 1110 and communication media. Computer storage media, such as memory 1110, includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device. In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism. As defined herein, computer storage media does not include communication media. While computer storage media (memory 1110) is shown in computing-based device 104, it should be appreciated that such storage can be distributed or remotely located and accessed via a network or other communication link (e.g., using communication interface 1108).

Platform software, including an operating system 1112 or any other suitable platform software, may be provided at the computing-based device to enable application software 218 to be executed on the device. Memory 1110 may store executable instructions to implement the functionality of body pose estimator 214 (described above with reference to fig. 2) and gesture recognition engine 216 (e.g., using a trained decision forest or regression model as described above). Memory 1110 may also provide data store 1114, which may be used to provide storage for data used by processor 1102 in performing gesture recognition techniques, such as parameters and/or templates for previously received data items, trained trees.

The term 'computer' as used herein refers to any device with processing capabilities such that it can execute instructions. Those skilled in the art will recognize that such processing capabilities are integrated into many different devices, and thus, the term 'computer' includes PCs, servers, mobile phones, personal digital assistants and many other devices.

The methods described herein may be performed by software in machine-readable form on a tangible storage medium, for example in the form of a computer program comprising computer program code means adapted to perform all the steps of any of the methods described herein when the program is run on a computer and wherein the computer program may be included on a computer-readable medium. Examples of tangible (or non-transitory) storage media may include disks, thumb drives, memory, and the like, without the propagated signal. The software may be adapted to be executed on a parallel processor or a serial processor such that the method steps may be performed in any suitable order, or simultaneously.

This confirms that the software can be a valuable, separately tradable commodity. It is intended to encompass software running on, or controlling, "dumb" or standard hardware to carry out the desired functions. It is also intended to encompass software, such as HDL (hardware description language) software, used to design silicon chips, or to configure general purpose programmable chips, that describes or defines the configuration of hardware to achieve a desired function.

Those skilled in the art will realize that storage devices utilized to store program instructions may be distributed across a network. For example, the remote computer may store an example of the process described as software. A local or terminal computer may access the remote computer and download a part or all of the software to run the program. Alternatively, the local computer may download pieces of the software, as needed, or execute some software instructions at the local terminal and other software instructions at the remote computer (or computer network). Those skilled in the art will also realize that by utilizing conventional techniques known to those skilled in the art that all, or a portion of the software instructions may be carried out by a dedicated circuit, such as a DSP, programmable logic array, or the like.

As will be clear to those skilled in the art, any of the ranges or device values given herein may be extended or altered without losing the effect sought.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

It is to be understood that the benefits and advantages described above may relate to one embodiment or may relate to multiple embodiments. The embodiments are not limited to embodiments that solve any or all of the problems or embodiments having any or all of the benefits and advantages described. It will further be understood that references to 'an' item refer to one or more of those items.

The steps of the methods described herein may be performed in any suitable order, or simultaneously, where appropriate. In addition, individual blocks may be deleted from any of the methods without departing from the spirit and scope of the subject matter described herein. Aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples without losing the effect sought.

The term "comprises/comprising" when used herein is intended to cover the identified blocks or elements of the method, but does not constitute an exclusive list and a method or apparatus may contain additional blocks or elements.

It will be understood that the above description of a preferred embodiment is given by way of example only and that various modifications may be made by those skilled in the art. The above specification, examples and data provide a complete description of the structure and use of exemplary embodiments of the invention. Although various embodiments of the present invention have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from the spirit or scope of this invention.

Claims

1. A computer-implemented gesture detection method, comprising:

receiving, at a processor (1102), a sequence of data items (600) relating to a motion of a user (102) performing a gesture;

testing a plurality of selected data items from the sequence (600) against a predefined threshold value to determine a probability that the sequence (600) represents a gesture; and

detecting the gesture if the probability is greater than a predetermined value.

2. The method of claim 1, further comprising the step of executing a command in response to detecting the gesture.

3. The method according to claim 1 or 2, wherein the data items in the sequence (600) represent at least one of:

i) an angle between at least two body parts of the user (102);

ii) a rate of change of an angle between at least two body parts of the user (102);

iii) a speed of movement of at least one body part of the user (102);

iv) features derived from the depth image of the user;

v) features derived from the user's RGB image;

vi) a speech signal from the user;

vii) inertia of the mobile device (300);

viii) the speed of the mobile device (300);

ix) orientation of the mobile device (300); and

x) a location of the mobile device (300).

4. A method according to claim 1, 2 or 3, wherein the step of testing comprises applying the sequence to a trained decision tree (500).

5. The method of claim 4, wherein the step of applying the sequence to a trained decision tree (500) comprises passing the sequence through a plurality of nodes in the tree until a leaf node (510) in the tree is reached, and wherein the probability that the sequence represents a gesture is determined in dependence on the leaf node (510) reached in the decision tree.

6. The method of claim 5, wherein each node of the decision tree (500) is associated with at least one of an index value and the predefined threshold value, and the step of applying the sequence (600) to the trained decision tree (500) comprises, at each node, comparing data items in the sequence (600) located at the index value with the at least one of the predefined threshold values to determine a subsequent node to which to send the sequence.

7. The method of claim 5 or 6, wherein the step of testing further comprises applying the sequence (600) to at least one further, discriminative, trained decision tree (502), and determining the probability that the sequence represents the gesture by averaging the probabilities from each of the trained decision trees.

8. The method according to any of the claims 4 to 7, further comprising the step of training the decision tree (500) prior to receiving the sequence (600), wherein the step of training the decision tree comprises:

selecting a node of the decision tree;

selecting at least one annotated training sequence;

generating a plurality of random index values and random threshold values;

comparing data items from the annotated sequence located at each of the random index values to each of the random threshold values to obtain a plurality of results;

selecting a selected index value and at least one selected threshold value for the node according to the plurality of results; and

storing the selected index value and the at least one selected threshold value associated with the node at a storage device.

9. The method of claim 8, wherein the step of selecting the selected index value and the at least one selected threshold value comprises determining an information gain for each of the plurality of results, and selecting the selected index value and the at least one selected threshold value associated with the result having the largest information gain.

10. A gesture recognition system, comprising:

a user interface (1104) arranged to receive a sequence of depth images of a user (102);

a memory (1110) arranged to store a random decision forest comprising a plurality of differentiated trained decision trees; and

a processor (1102), the processor being arranged to: generating a sequence (600) of data items relating to the motion of the user (102) from the depth image; applying the sequence of data items (600) to each of the trained decision trees to obtain a plurality of probabilities that the sequence represents one of a plurality of predefined gestures; aggregating probabilities from each of the trained decision trees; and if the aggregated probability of the gestures is greater than a predetermined value, executing a command associated with the detected gesture.