Detailed Description
      Embodiments will now be described, by way of example only, with reference to the accompanying drawings. Like reference numbers and characters in the drawings indicate like elements or equivalents.
      Some portions of the following description are presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, considered to be a self-consistent sequence of steps leading to a desired result. These steps are those requiring physical manipulations of physical quantities such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.
      Unless specifically stated otherwise, and as will be apparent from the following, it is appreciated that throughout the present document, discussions utilizing terms such as "receiving," "scanning," "computing," "determining," "replacing," "generating," "initializing," "outputting," or the like, refer to the action and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.
      Also disclosed herein are apparatuses for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, or may comprise a computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of a more specialized apparatus for carrying out the required method steps may be appropriate. The structure of a computer adapted to perform the various methods/processes described herein will appear from the description below.
      Further, a computer program is implicitly disclosed herein, since it is clear to a person skilled in the art that the individual steps of the methods described herein can be implemented by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and code therefor may be used to implement the teachings of the disclosure as contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variations of computer programs that may use different control flows without departing from the spirit or scope of the present invention.
      Furthermore, one or more steps of a computer program may be executed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer-readable medium may include a storage device such as a magnetic or optical disk, a memory chip, or other storage device suitable for interfacing with a computer. The computer readable media may also include hard-wired media such as those exemplified in the internet systems, or wireless media such as those exemplified in the GSM mobile phone system, as well as other wireless systems such as bluetooth, ZigBee, Wi-Fi. When loaded and executed on such a computer, effectively creates means for implementing the steps of the preferred method.
      "electronically know your customer (eKYC)" is a digital due diligence process performed by business entities or service providers to verify the identity of their customers to prevent identity fraud. Authentication may be considered a form of fraud detection in which the user's legitimacy is verified and a potential fraudster may be detected before fraud is carried out. Effective authentication can enhance the data security of the system, thereby protecting the digital data from unauthorized users.
      The techniques described herein produce one or more technical effects. In particular, by calculating a similarity/difference between a first eye pupil image of a user and a second eye pupil image of the user (wherein the second eye pupil image of the user is photographed when white light is projected onto the face of the user), the user authentication method and system may reduce an attack success rate on the eKYC process, and may be particularly effective for recognizing an attack using a 3D mask. If the first eye pupil image of the user and the second eye pupil image of the user are determined to be similar or identical, an attack on the eKYC process may be recognized.
      Further, the user authentication methods and systems may provide greater accuracy in detecting an attack by calculating a confidence score based on extracted supplemental features associated with a facial image of a user (where one or more facial images of the user were taken while projecting one or more non-white lights onto the user's face).
      Embodiments seek to provide a user authentication method involving a flash-based face anti-spoofing method to detect photos, screenshots, 2D and 3D masks spoofing faces. In the case of a live face, when the glints are projected onto the pupil area, white spots are expected to be seen in and/or around the pupil area. However, in the case of a photograph, screenshot, 2D or 3D mask spoofing a human face, it is expected that white dots will not be seen in and/or around the pupil region.
      An image capturing device (e.g., an RGB camera) is used to capture an image of a user's face without any flash being projected onto the user's face. The RGB camera is equipped with a CMOS sensor through which an image of the user's face (or any image presented by an attacker for authentication, such as a photograph, high resolution screen shot, 2D/3D print mask, etc.) can be acquired. An eye region of the photographed face image of the user is cut out so as to be focused on the eye pupil region. The cropped pupil area image (labeled "a") is collected. In image "a", no flash is projected on the user's face/eyes.
      By first extracting face points including eyepoints, eye regions can be cut out. Wuyue (Yue Wu) et al published in 2018, 5 and 15 months "detection of human face characteristic points: literature Survey (Facial Landmark Detection: a Literrure Survey) "discloses a technique for extracting face points. Thereafter, an eye region/pupil region is cut out based on the eyepoint. For example, a rectangular area containing one eye may be cropped by obtaining the leftmost eyepoint, the rightmost eyepoint, the uppermost eyepoint, and the lowermost eyepoint. The eye pupil region may be cropped in a similar manner.
      The image photographing apparatus is used to photograph a face image of a user while projecting a flash to the face of the user. The eye region of the photographed face image of the user is cut out so as to focus more on the eye pupil region. The corresponding eye pupil region image (labeled "B") is collected.
      The white dots are expected to be clearly seen in the pupil region (with glints) of image B. On the other hand, it is expected that white spots will not be seen in the pupil region of image a (without using a flash).
      For each of the pupil region images a and B, a convolutional neural (CNN) network (N1) is used to extract features. CNN network N1 may be trained using images of the large pupillary region, whether or not glints are projected onto the eyes. In one embodiment, the feature extractor is trained using resnet18 as the network structure.
      The similarity score S1 is calculated based on the features extracted from the eye pupil region images a and B. In one embodiment, to calculate the similarity of two eigenvectors, a cosine similarity method is used. Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between the two vectors and determines whether the two vectors point in approximately the same direction. The formula that can be used to calculate the cosine similarity is as follows.
      
      Wherein A isiAnd BiAre the components of vectors a and B, respectively.
      If the similarity score S1 is greater than the predetermined threshold (T1), it means that the two images A and B are similar. In other words, the result indicates that image B is a spoofed image (e.g., image B is an image of a 3D mask or screenshot) because no white dots are observed in the eye pupil region. On the other hand, if the score S1 is smaller than the predetermined threshold (T1), it means that the two images a and B are different. In other words, a white spot is expected to be observed in the eye pupil region of the image B.
      In an embodiment, to make fraud detection more robust and accurate, additional steps are performed that include projecting a series of colored lights to the user's face to obtain multiple frames with different colors (e.g., without limitation, red, blue, yellow, green). The colored light is projected onto the entire human face, and a plurality of frames having different colors represent the entire human face, not only the eye pupil regions (i.e., eye pupil region images a and B).
      For multiframes with different colors, features are extracted using a CNN network (N2). CNN network N2 may be trained using large scale multiframes with different colors on live and spoofed faces. In one embodiment, resnet18 is employed as a network structure to train binary classifier N2.
      The confidence score S2 is calculated based on features extracted from multiple frames having different colors.
      The similarity score S1 and the confidence score S2 may be fused to obtain a final authentication result. The decision function may be defined as: if S1< T1 or S2> T2, the result is a spoofed face, otherwise the result is a live face. The predetermined thresholds T1 and T2 may be calculated based on the validation data set.
      In summary, the authentication method can be divided into two phases. The first "flash" stage involves obtaining a similarity score S1 based on features extracted from the eye pupil region image a (no flash projected to the user 'S eye pupil region) and the eye pupil region image B (flash projected to the user' S eye pupil region). The second "color sequence" stage involves obtaining confidence scores based on features extracted from multiple frames having different colors projected onto the user' S face S2.
      The color sequence phase can effectively detect most spoofing attacks, such as high resolution photos/screenshots and 2D paper masks. However, for a 3D mask, the color sequence phase may not work properly because the surface or material of the 3D mask may be very similar to a human face. During the "flash" phase when a flash (very bright white light) is projected onto the human eye, especially the pupil area, a significant difference can be observed from the two frames (i.e. with/without flash), but the difference may not be found if spoofing is done using a 3D mask. This is because the pupil area of the human eye and the material of the 3D mask are different. To make fraud detection more robust and cover more types of attacks, the similarity score S1 and the confidence score S2 are fused together to make the final decision.
      Fig. 1 is a flowchart 100 illustrating an example of a user authentication method according to an embodiment. At step 102, a trained Convolutional Neural Network (CNN) extracts one or more first features associated with a first pupil image of a user. At step 104, the same trained CNN extracts one or more second features associated with a second eye pupil image of the user. The second eye pupil image of the user is taken while white light (flash light) is projected to the face of the user. On the other hand, the first pupil image of the user is photographed without white light (flash) projected to the face of the user.
      The CNN may be trained using: a large data set comprising an eye pupil image of an eye pupil upon which white light is projected, and a large data set comprising an eye pupil image of an eye pupil upon which white light is not projected. CNN may employ a resnet18 network architecture.
      The light source for the white flash may be a built-in flash of a camera-integrated smartphone. The camera of the smartphone may be used to capture a first eye pupil image of the user (i.e., when the built-in flash is not activated) and a second eye pupil image of the user (i.e., when the built-in flash is activated).
      In particular, the method may comprise the step of capturing a first face image of the user comprising a first pupil image using an image sensor of the (camera of the smartphone). Then, the eye detection method is applied to the first face image of the user to generate a first eye detection frame. An area of the first face image of the user within the first eye detection frame corresponds to the first pupil image. Similarly, the method may include the step of capturing a second facial image of the user including a second eye pupil image using the image sensor. And then, applying the eye detection method to a second face image of the user to generate a second eye detection frame. The region of the second face image of the user within the second eye detection frame corresponds to the second eye pupil image.
      At step 106, a similarity score is generated based on the extracted first feature and the extracted second feature.
      In step 108, the user is authenticated in case the similarity score indicates a difference between the extracted first feature and the extracted second feature.
      In the case of a photograph, screenshot, 2D or 3D mask spoofing a human face, when a flash of light is projected onto the eye pupil region, white dots are not visible in and/or around the eye pupil region.
      On the other hand, in the case of a live face, when the flash light is projected onto the eye pupil region, white spots can be seen in and/or around the eye pupil region. Fig. 2a shows an example of an eye pupil area image 202 in the case where no flash is projected to the eye. For comparison, fig. 2b, 2c, and 2d show examples of eye pupil region images (of a live face) in the case where glints are projected onto the eyes. In fig. 2b, the white spot (dot/spot)204 can be seen in the pupil region. In fig. 2c, two white dots 206 can be seen in the eye pupil region. In fig. 2d, white dots 208 can be seen around the eye pupil area. Depending on the number of flashes and the angle of the flashes with respect to the eye pupil, white spots may appear at different locations in and/or around the eye pupil area.
      The presence of white spots in and/or around the pupil area of a living face when glints are projected onto the eyes is due to the glints emanating/reflecting from the optic nerve. This occurs when the glints enter the eyes at an angle, resulting in a white eye effect. This is also referred to as "disc reflection" or "white reflection".
      In addition to the white eye effect, red dots (dot (s)/spot (s)) may appear in and/or around the pupil area of a live face when glints are projected onto the eyes. This is known as the red-eye effect, where glints occur too quickly to close the pupil, so many glints enter the eye through the pupil, reflect off the back of the eye (where there is a large amount of blood) and reflect off through the pupil. The camera records the reflected light.
      The white eye effect or the red eye effect causes a difference between the first eye pupil image of the user and the second eye pupil image of the user. Therefore, there is a difference between the extracted first feature and the extracted second feature. If the similarity score (determined at step 106) indicates a difference between the extracted first feature and the extracted second feature, the user may be authenticated.
      In an embodiment, in order to make fraud detection more robust and accurate, the method may further comprise the following steps. First, a trained complementary Convolutional Neural Network (CNN) is used to extract one or more complementary features associated with a complementary facial image of a user. The supplemental facial image of the user is taken while non-white light is projected onto the user's face. A confidence score is then generated based on the extracted supplemental features.
      In one embodiment, the confidence score is predicted using the softmax function of the trained CNN. The softmax function is used for the last layer of the neural network based classifier. The softmax function is used to map the non-normalized output of the network to a probability distribution over the class of predicted outputs. The supplemental CNN is trained on a large-scale sequence of facial images, both live and spoofed. During the authentication phase, a confidence score is generated from the trained CNN with the softmax function based on a face image taken when non-white light is projected onto the user's face.
      To make fraud detection more robust and accurate, the user is authenticated in the event that (i) the similarity score indicates a discrepancy between the extracted first feature and the extracted second feature, or (ii) the confidence score is greater than a predetermined confidence threshold.
      The supplemental CNN is different from the CNN described in step 102 above. The supplemental CNN may be trained using a large dataset of facial images, including live facial images and spoofed facial images captured by projecting non-white light. The supplementary CNN may employ a resnet18 network structure.
      To make fraud detection more robust and accurate, multiple supplemental facial images of the user may be taken while multiple non-white lights are projected onto the user's face in sequence (e.g., blue, then red, then yellow, etc.). The light source that is not white light may be a display screen of a smartphone. The smartphone display may be configured to display a blue screen first, then a red screen, then a yellow screen, etc. An image sensor (of the camera of the smartphone) may be used to capture images of the user's face while each color is displayed. In other words, if three different non-white lights are displayed and thus projected onto the user's face, three different facial images of the user are captured.
      The method also includes extracting a plurality of supplemental features associated with a plurality of supplemental facial images of the user using the trained supplemental CNN. Thereafter, a confidence score is generated based on the extracted plurality of supplemental features.
      A similarity score greater than a predetermined similarity threshold indicates a difference between the extracted first feature and the extracted second feature. The similarity threshold and the confidence threshold may be determined separately based on the validation data set. The validation data set includes live face images and spoofed face images, and a Receiver Operating Characteristic (ROC) curve may be calculated from the validation data and the label of the prediction. In one embodiment, according to the ROC curve, a threshold value is set when FAR (false acceptance rate) is equal to 0.01 or 0.001.
      Fig. 3 shows a schematic diagram of a computer system suitable for performing at least some of the steps of a user authentication method.
      The following description of computing system/computing device 300 is provided by way of example only and is not intended to be limiting.
      As shown in fig. 3, the exemplary computing device 300 includes a processor 304 for executing software routines. Although a single processor is shown for clarity, computing device 300 may also include a multi-processor system. The processor 304 is connected to a communication infrastructure 306 to communicate with other components of the computing device 300. The communication infrastructure 306 may include, for example, a communication bus, a crossbar, or a network.
       Computing device 300 also includes a main memory 308, such as Random Access Memory (RAM), and a secondary memory 310. The secondary memory 310 may include, for example, a hard disk drive 312 and/or a removable storage drive 314, where the removable storage drive 314 may include a magnetic tape drive, an optical disk drive, etc. The removable storage drive 314 reads from and/or writes to a removable storage unit 318 in a well known manner. Removable storage unit 318 may comprise a magnetic tape, an optical disk, etc. which is read by and written to by removable storage drive 314. As will be appreciated by one skilled in the relevant art, removable storage unit 318 includes a computer-readable storage medium having stored therein computer-executable program code instructions and/or data.
      In alternative embodiments, secondary memory 310 may additionally or alternatively include other similar means for allowing computer programs or other instructions to be loaded into computing device 300. Such devices may include, for example, a removable storage unit 322 and an interface 320. Examples of removable storage unit 322 and interface 320 include a removable storage chip (e.g., an EPROM, or PROM) and associated socket, and other removable storage units 322 and interfaces 320, 320 that allow software and data to be transferred from removable storage unit 322 to computer system 300.
       Computing device 300 also includes at least one communication interface 324. Communications interface 324 allows software and data to be transferred between computing device 300 and external devices via communications path 326. In various embodiments, communication interface 324 allows data to be transferred between computing device 300 and a data communication network, such as a public or private data communication network. The communication interface 324 may be used to exchange data between different computing devices 300, which computing devices 300 form part of an interconnected computer network. Examples of communication interface 324 may include a modem, a network interface (such as an ethernet card), a communication port, an antenna with associated circuitry, and the like. The communication interface 324 may be wired or may be wireless. Software and data transferred via communications interface 324 are in the form of signals which may be electrical, electromagnetic, optical or other signals capable of being received by communications interface 324. These signals are provided to the communications interface via communications path 326.
      Optionally, the computing device 300 further comprises: a display interface 302 that performs operations for presenting images to an associated display 330; and an audio interface 432 that performs operations for playing audio content via the associated speaker 334.
      As used herein, the term "computer program product" may refer, in part, to removable storage unit 318, removable storage unit 322, a hard disk installed in hard disk drive 312, or a carrier wave carrying software to communication interface 324 over a communication path 326 (wireless link or cable). Computer-readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to computing device 300 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tapes, CD-ROMs, DVDs, Blu-ray (Blu-ray)TM) Optical disks, hard drives, ROMs, or integrated circuits, USB memory, magneto-optical disks, or computer readable cards such as PCMCIA cards, whether internal to computing device 300 or otherwiseAnd (3) an external part. Examples of transitory or non-tangible computer-readable transmission media that may also participate in providing software, applications, instructions, and/or data to computing device 300 include a radio or infrared transmission channel and a network connection to another computer or networked device, as well as the internet or ethernet, etc., including information recorded on email transmissions, websites, and the like.
      Computer programs (also called computer program code) are stored in main memory 308 and/or secondary memory 310. Computer programs may also be received via communications interface 324. Such computer programs, when executed, enable computing device 300 to perform one or more features of embodiments discussed herein. In various embodiments, the computer programs, when executed, enable the processor 304 to perform the features of the embodiments described above. Accordingly, such computer programs represent controllers of the computer system 300.
      The software is stored in a computer program product and may be loaded into computing device 300 using removable storage drive 314, hard drive 312, or interface 320. Alternatively, the computer program product may be downloaded to computer system 300 over communications path 326. The software, when executed by the processor 304, causes the computing device 300 to perform the functions of the embodiments described herein.
      It should be understood that the embodiment of fig. 3 is given by way of example only. Thus, in some embodiments, one or more features of computing device 300 may be omitted. Also, in some embodiments, one or more features of computing device 300 may be combined together. Additionally, in some embodiments, one or more features of computing device 300 may be separated into one or more components.
      Fig. 4 is a schematic diagram illustrating an example of a user authentication system 400 according to an embodiment. The user authentication system includes an extraction device 402, a score generation device 404, and an authentication device 406. The extraction device 402 extracts one or more first features associated with a first pupil image of a user using a trained Convolutional Neural Network (CNN). The extraction device 402 also extracts one or more second features associated with a second eye pupil image of the user using the trained CNN. The second eye pupil image of the user is taken when the white light is projected onto the face of the user.
      The score generation device 404 generates a similarity score based on the extracted first feature and the extracted second feature. The authentication device 406 authenticates the user if the similarity score indicates a difference between the extracted first feature and the extracted second feature.
      The extraction device 402 may also extract one or more supplemental features associated with a supplemental facial image of the user using a trained supplemental Convolutional Neural Network (CNN). The supplemental facial image of the user is taken while projecting non-white light onto the user's face. The score generation device 404 may generate a confidence score based on the extracted supplemental features. The authentication device 406 authenticates the user if: (i) the similarity score indicates a difference between the extracted first feature and the extracted second feature, or (ii) the confidence score is greater than a predetermined confidence threshold.
      The system 400 may also include an image sensor 408 to capture a first face image (containing a first pupil image) of the user and apply an eye detection method to the first face image of the user to generate a first eye detection box. An area of the first face image of the user within the first eye detection frame corresponds to the first pupil image. The image sensor 408 also captures a second face image (including a second eye pupil image) of the user, and applies the eye detection method to the second face image of the user to generate a second eye detection frame. The region of the second face image of the user within the second eye detection frame corresponds to the second eye pupil image.
      The CNN may be trained using a large data set of eye pupil images that include an eye pupil upon which white light is projected. The supplemental CNN may be trained using a large data set comprising live face images captured by projecting non-white light and spoofed face images. Both CNN and supplementary CNN may employ a resnet18 network architecture.
      Multiple supplemental facial images of the user may be taken while multiple non-white lights are sequentially projected onto the user's face. The extraction device 402 may use the trained supplemental CNN to extract a plurality of supplemental features associated with a plurality of supplemental facial images of the user, and the score generation device may generate a confidence score based on the plurality of extracted supplemental features.
      A similarity score greater than a predetermined similarity threshold indicates a difference between the extracted first feature and the extracted second feature. The similarity threshold and the confidence threshold are determined separately based on the validation dataset.
      The term "configured" is used herein in connection with systems, devices, and computer program components. For a system of one or more computers configured to perform a particular operation or action, it is meant that the system has installed thereon software, firmware, hardware, or a combination thereof that in operation causes the system to perform the operation or action. For one or more computer programs configured to perform specific operations or actions, it is meant that the one or more programs include instructions, which when executed by a data processing apparatus, cause the apparatus to perform the operations or actions. By dedicated logic circuitry configured to perform a particular operation or action is meant that the circuitry has electronic logic to perform the operation or action.
      It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments herein without departing from the spirit or scope of the invention as broadly described. The described embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.