CN113033551B

CN113033551B - Method, device, equipment and storage medium for detecting object

Info

Publication number: CN113033551B
Application number: CN202110281522.3A
Authority: CN
Inventors: 徐一涛; 刘宁; 车正平; 唐剑
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2024-09-24
Anticipated expiration: 2041-03-16
Also published as: CN113033551A

Abstract

According to embodiments of the present disclosure, a method, apparatus, device, and storage medium for object detection are provided. The method proposed herein comprises: determining a first set of candidate boxes in a first image captured by an image capture device, the first set of candidate boxes indicating objects that may be included in the first image; determining a target region based on the first set of candidate boxes; determining a sub-image corresponding to the target region from a second image subsequently captured by the image capturing device; and detecting at least one object from the sub-image. According to the facts of the present disclosure, object detection can be performed using sub-images having smaller sizes in the subsequent object detection process, thereby improving the efficiency of object detection.

Description

Method, device, equipment and storage medium for detecting object

Technical Field

Implementations of the present disclosure relate to the field of computers, and more particularly, to methods, apparatuses, and computer storage media for object detection.

Background

At present, with the development of computer vision technology, object detection technology is widely applied to various aspects such as security protection, monitoring, intelligent transportation and the like. To improve the real-time nature of the process, some schemes utilize embedded edge computing devices to perform object detection, such as detecting, counting, or tracking pedestrians in a screen. However, since the computing power of embedded edge computing devices is relatively limited, how to improve the efficiency of object detection is currently a focus of attention.

Disclosure of Invention

Embodiments of the present disclosure provide a solution for object detection.

In a first aspect of the present disclosure, a method for object detection is provided. The method comprises the following steps: determining a first set of candidate boxes in a first image captured by an image capture device, the first set of candidate boxes indicating objects that may be included in the first image; determining a target region based on the first set of candidate boxes; determining a sub-image corresponding to the target region from a second image subsequently captured by the image capturing device; and detecting at least one object from the sub-image.

In a second aspect of the present disclosure, an apparatus for object detection is provided. The device comprises: a first candidate box determination module configured to determine a first set of candidate boxes in a first image captured by the image capture device, the first set of candidate boxes indicating objects that may be included in the first image; a target region determination module configured to determine a target region based on the first set of candidate boxes; a sub-image determining module configured to determine a sub-image corresponding to the target area from a second image subsequently captured by the image capturing device; and an object detection module configured to detect at least one object from the sub-image.

In a third aspect of the present disclosure, there is provided an electronic device comprising: a memory and a processor; wherein the memory is for storing one or more computer instructions, wherein the one or more computer instructions are executable by the processor to implement a method according to the first aspect of the present disclosure.

In a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement a method according to the first aspect of the present disclosure.

According to various embodiments of the present disclosure, object detection may be performed using sub-images having smaller sizes in a subsequent object detection process, thereby improving efficiency of object detection. Such an approach can reduce the need for computing power of the device, with better versatility.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals designate like or similar elements, and wherein:

FIG. 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a schematic diagram of object detection according to some embodiments of the present disclosure;

FIG. 3 illustrates a flowchart of an example process of object detection, according to some embodiments of the present disclosure;

FIG. 4 illustrates a schematic block diagram of an apparatus for determining a travelable region according to some embodiments of the present disclosure; and

FIG. 5 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

In the field of computing vision, conventional object detection methods applied to embedded edge computing devices mainly include the following categories:

(1) A method of frame-by-frame detection based on deep learning. On the one hand, real-time detection with high complexity models cannot be supported due to the computing power of the embedded edge computing device. Therefore, this method is difficult to achieve a balance between real-time performance and detection accuracy.

(2) Based on a simple cutting method. The traditional method based on cutting fully utilizes the time sequence continuity of video input to improve the detection precision. However, in this method, since the region where the object (e.g., pedestrian) appears in the image is highly random, a plurality of regions containing the object may appear, and at this time, detection of all the regions containing the object is required, which may result in more time consuming detection of the entire picture than using a large model.

(3) Methods based on object tracking. The whole idea of the traditional target tracking-based method is that a target detection algorithm is used for a certain frame in a video to obtain a detection frame of a pedestrian; all the detection frames are input into a target tracking algorithm, and pedestrian detection results of subsequent video frames are given frame by the target tracking algorithm. However, such a method has low detection accuracy.

(4) Methods based on image differencing. The traditional method based on image difference relates to prior object pixel value judgment, so that the method is easy to be influenced by factors such as illumination, weather and the like, and more missed detection situations can occur in noon of strong illumination or in the evening of weak illumination. In addition, the process of calculating the inter-frame difference is easy to be influenced by the resolution of the video image, and the time for calculating the inter-frame difference at high resolution is long.

Therefore, how to improve the accuracy of object detection while ensuring detection instantaneity has become the focus of current attention.

According to an embodiment of the present disclosure, a solution for object detection is provided. In an embodiment of the present disclosure, first, a first set of candidate boxes in a first image captured by an image capture device is determined, wherein the first set of candidate boxes indicates objects that may be included in the first image. Then, a target region is determined based on the first set of candidate boxes, and a sub-image corresponding to the target region is determined from a second image subsequently captured by the image capturing device. Accordingly, at least one object is detected from the sub-images. According to the facts of the present disclosure, object detection can be performed using sub-images having smaller sizes in the subsequent object detection process, thereby improving the efficiency of object detection.

Example Environment

Referring first to FIG. 1, a schematic diagram of an environment 100 in which an exemplary implementation according to the present disclosure may be implemented is schematically shown. As shown in fig. 1, the environment 100 may include an image capture device 120 configured to capture one or more images 110 of a target area.

In some implementations, the image capture device 120 may be, for example, an overhead camera, for example, may be mounted to the pylon 130 for security monitoring purposes. It should be appreciated that image capture device 120 may also be mounted in other suitable manners.

As shown in fig. 1, image capture device 120 may also be coupled to computing device 140. In some implementations, computing device 140 may include, for example, an embedded edge computing device configured to acquire image 110 captured by image capture device 120 and perform an object detection method to detect one or more objects 150 from image 110.

In some implementations, the object 150 may include, for example, a pedestrian, an animal, a vehicle, and so forth, as appropriate. Computing device 140 may determine a corresponding location of the object in image 110. For example, computing device 140 may generate one or more candidate boxes to indicate that the region may include an object.

In some implementations, the object detection information (e.g., candidate boxes) determined by the computing device 140 may be further used in connection with counting of objects, identification of target objects, or tracking of objects. The detailed process regarding object detection will be described in detail below with reference to fig. 2 to 3.

Example principle

Fig. 2 shows a schematic diagram 200 of object detection according to an embodiment of the present disclosure. As shown in fig. 1, a first recognition model 215 and a second recognition model 235 may be deployed in the computing device 140. In some implementations, the first recognition model 215 and the second recognition model 235 may include, for example, appropriate machine learning models for detecting objects from input images.

In some implementations, the first recognition model 215 and the second recognition model 235 may have different capabilities, for example. For example, the model complexity of the first recognition model 215 may be higher than the second recognition model 235. Taking the deep neural network as an example, the first recognition model 215 may have more complex input features, have more layers, etc., so that the first recognition model 215 can process higher resolution images to output more accurate detection results. It should be appreciated that the first recognition model 215 may require a higher time consumption than the second recognition model 235 with a fixed computing power of the computing device 140.

As shown in fig. 2, the computing device 140 may utilize the first recognition model 215 to process a first image 210-1 captured by the image capture apparatus 120 at a first time to determine a first set of candidate boxes 220 in the first image 210-1. The first set of candidate boxes 220 may, for example, indicate areas in the first image 210-1 that may include objects.

In some implementations, the confidence level of each candidate box in the first set of candidate boxes 220 determined by the first recognition model 215 is above a predetermined first confidence threshold. For example, the first recognition model 215 may determine, as the first set of candidate boxes 220, candidate boxes having a confidence level higher than 0.15 among all candidate boxes.

In some implementations, the computing device 140 may also pre-process the first image 210-1 before the first image 210-1 is provided to the first recognition model 215. In particular, the computing device 140 may adjust the first image 210-1 to adapt the first recognition model 215. Illustratively, the computing device 140 may crop or scale the first image 210-1 to adapt the first recognition model 215.

For example, the first recognition model 215 may be a EFFICIENTDET-D2 model that requires downsampling the input image to 1/128 of the original size at the feature extraction portion, and thus, to improve the accuracy of the model processing, the computing device 210 may process the first image 210-1 so that its resolution is a multiple of 128. For example, if the resolution of the first image 210-1 is 1024 x 512, the computing device 140 may first scale down the first image 210-1 to 1024 x 576 at a resolution of 1920 x 1080, then crop 32 pixels from each of the top and bottom of the scaled down image, changing the image resolution to 1024 x 512. In this manner, the computing device 140 may ensure that the input image of the first recognition model 215 is a multiple of 128. On the other hand, in actual operation, the probability that the object appears at the top and bottom of the image is small, and thus the final detection accuracy is not affected by performing the trimming operation.

It should be appreciated that other suitable models may also be utilized as the first recognition model 215, e.g., EFFICIENTDET-D3/D4, FASTERRCNN, etc. The present disclosure is not intended to be limited to a particular type of first recognition model 215.

Additionally, the computing device 140 may process the adjusted first image using the first recognition model 215 to determine a first set of candidate boxes 220. In some implementations, the computing device 140 may further determine a target region 225 having a predetermined size based on the first set of candidate boxes 220 determined by the first recognition model 215.

In some implementations, rather than conventionally determining all candidate boxes directly as clipping regions, the computing device 140 may determine a single target region 225 having a target size based on the first set of candidate boxes 220.

In some implementations, the computing device 140 may determine a target candidate box in the first set of candidate boxes 220 based on the location of the first set of candidate boxes 220, wherein the number of neighboring candidate boxes for the target candidate box is above a predetermined threshold. The neighboring candidate frame represents a candidate frame having a distance from the target candidate frame less than a predetermined threshold.

In some implementations, the computing device 140 may determine a distance between two candidate boxes based on a distance of a center point of the two candidate boxes. Accordingly, the computing device 140 may determine a distance between every two candidate boxes in the first set of candidate boxes 220 and determine candidate boxes having a distance less than a predetermined threshold as neighboring candidate boxes.

The computing device 140 may thereby determine a number of neighboring candidate boxes for each candidate box in the first set of candidate boxes 220, and determine candidate boxes where the number is greater than a predetermined threshold as target candidate boxes. For example, the computing device 140 may determine the candidate box with the most neighboring candidate box as the target candidate box. When there are multiple nearby candidate boxes, the computing device 140 may, for example, randomly select one of them as the target candidate box.

Additionally, the computing device 140 may determine a target region 225 having a predetermined size based on the location of the target candidate frame. For example, the computing device 140 may determine the target region 225 having a predetermined size based on the location of the center point of the target candidate frame and centering the location as the target region. For example, if the center point of the target candidate frame is (x, y) and the size of the target region 225 is (w, h), four vertices of the target candidate frame may be represented as (x-w, y-h), (x-w, y+h), (x+w, y-h), and (x+w, y+h), so that the range of the target region 225 may be determined.

In some implementations, computing device 140 may also determine target region 225, for example, using a method of traversing the image. For example, the computing device 140 may determine the target region 225 having a predetermined size based on the location of the first set of candidate boxes 220 such that the target region covers at least one candidate box of the first set of candidate boxes, wherein a ratio of a first number of candidate boxes of the at least one candidate box to a second number of candidate boxes of the first set of candidate boxes is greater than a predetermined threshold.

For example, the computing device 140 may traverse the image with a region of a predetermined size as a sliding window, for example, to determine the number of candidate boxes covered by the different regions. Here, the overlay may indicate that the center point of the candidate box is located in the region, or that the four vertices of the candidate box are all located in the region.

Illustratively, the computing device 140 may determine the target region 225 by way of a sliding window such that the target region 225 is capable of covering more than a predetermined proportion of the candidate boxes in the first set of candidate boxes 220.

In this way, a target area having a size smaller than the original image can be determined. It should be appreciated that in general, the areas in the image where objects may appear are relatively concentrated.

In some implementations, the image capture device 120 is more than a predetermined threshold distance from the capturable object. For example, the image capturing device 120 may be an overhead camera device for detecting pedestrians, which makes the area in the captured image of the camera device where pedestrians may be present more concentrated.

As shown in fig. 2, computing device 140 may acquire a second image 210-2 captured by image capture apparatus 120 at a second time that is later than the first time. In some implementations, the first image 210-1 and the second image 210-2 are adjacent image frames in video captured by the image capture device 120.

For the second image 210-2, the computing device 140 may first determine a sub-image 230 corresponding to the target region 225 and provide the sub-image 230 to the second recognition model 235 for use in detecting the first object in the sub-image 230.

In some implementations, the second recognition model 235 may determine, for example, a second set of candidate boxes 240 in the sub-image 230 associated with the first object. The second recognition model 235 may be, for example, a EFFICIENTDET-D0 detection model, which is trained, for example, to process images of 256 x 256 sizes. It should be appreciated that the second recognition model 235 may also employ other suitable models, such as Yolo v/v 4 models, and the like. The present disclosure is not intended to be limited to a particular type of second recognition model 235.

In some implementations, the confidence of the second set of candidate boxes 240 determined by the second recognition model 235 may be above a second confidence threshold, for example. In order to increase the processing speed of the second recognition model 235, the second confidence threshold may be set lower than the first confidence threshold associated with the first recognition model 215, for example, in view of less interference in the sub-images.

Since the second recognition model 235 is a model with a lower model complexity, it may be efficiently executed by the computing device 140. Also, since the image size processed by the second recognition model 235 is relatively small, it can improve the accuracy of object detection while maintaining timeliness. On the other hand, since the distribution of the objects in the image is relatively concentrated, the determined sub-image can cover most of the objects, and can meet the application requirements in most of the scenes.

In some implementations, the computing device 140 may utilize matches between the first set of candidate boxes 220 and the second set of candidate boxes 240 for object tracking. Illustratively, the computing device 140 may determine tracking information for the first object at the second time based on the match between the first set of candidate boxes 220 and the second set of candidate boxes 240, wherein the tracking information includes at least one of: position, direction of movement, or rate of movement.

For example, the computing device 140 may determine that a first candidate box of the first set of candidate boxes 220 corresponds to the same object as a second candidate box of the second set of candidate boxes 240, and accordingly, the computing device 140 may determine a location, a method of movement, a rate of movement, and the like of the object at a second time based on a difference in location between the first candidate box and the second candidate box.

In some implementations, to ensure a balance between processing efficiency and object detection performance, computing device 140 may also periodically process the complete input image using first recognition model 215 and process the sub-image corresponding to the target region using second recognition model 235 for the intermediate frame.

As shown in fig. 2, the computing device 140 may continue processing a predetermined number (e.g., one or more) of images (e.g., image frames), for example, using the second recognition model 215. For example, the computing device 140 may also process the sub-image 245 corresponding to the target region 225 in the image 210-3 captured by the image capture device 120, for example, using the second recognition model 235, and output a set of candidate boxes 250 in the sub-image 245.

For a third image 210-4 captured by the image capture device 120 at a third time, the computing device 140 may utilize the first recognition model 215 to process according to the procedure discussed with reference to the first image 210-1 for determining a set of candidate boxes 255. In some implementations, the third time is later than the first time by a predetermined time interval.

It should be appreciated that the set of candidate boxes 255 may be further used, for example, to determine a new target region and for a process by which the computing device 140 processes subsequent images using the second recognition model 235.

In some scenarios, for the first set of candidate boxes 220, the determined target region 225 may not be able to overlay a portion of the candidate boxes therein, such that the second recognition model 235 may also be unable to detect objects corresponding to the portion of the candidate boxes.

In some implementations, the computing device 140 may utilize object detection results of previous processing of the fourth image by the first recognition model 235 to estimate motion information for the objects. For example, the computing device 140 may obtain a third set of candidate boxes determined by the first recognition model based on a fourth image captured by the image capture device at a fourth time instant that is a predetermined time interval earlier than the first time instant.

Additionally, the computing device 140 may determine motion information for a third object at the first time based on a match between the third set of candidate boxes and the first set of candidate boxes, the third object being associated with a predetermined candidate box of the first set of candidate boxes that is outside the target area. For example, the computing device 140 may determine that a third candidate box of the third set of candidate boxes indicates the same third object as a fourth candidate box of the first set of candidate boxes 220, and the computing device 140 may determine motion information, e.g., a direction of motion and a rate of motion, for the third object based on a distance between the third candidate box and the fourth candidate box and based on a time interval between the first time and the fourth time.

Additionally, the computing device 140 may also determine tracking information for the third object at the second time based on the predetermined candidate box and the motion information in the first image 210-1. For example, the computing device 140 may calculate a position, a direction of movement, and a rate of movement of the third object at the second time based on the determined direction of movement and the rate of movement.

In this manner, while the second recognition model 235 may not detect the third object at the sub-image 230, embodiments of the present disclosure are able to infer tracking information of the third object based on the prior detection results of the first recognition model, thereby guaranteeing the integrity of object detection.

Example procedure

An example process of object detection is described below in connection with fig. 3. FIG. 3 illustrates a flowchart of an example process 300 of object detection, according to some embodiments of the present disclosure. Process 300 may be implemented, for example, by computing device 140 in fig. 1.

As shown in fig. 3, at block 302, the computing device 140 determines a first set of candidate boxes in a first image captured by an image capture apparatus at a first time, the first set of candidate boxes indicating regions in the first image that may include an object.

At block 304, the computing device 140 determines a target region having a predetermined size based on the first set of candidate blocks.

At block 306, the computing device 140 determines a sub-image corresponding to the target region from a second image captured by the image capture apparatus at a second time instant, the second time instant being later than the first time instant.

At block 308, the computing device 140 detects a first object from the sub-image.

In some implementations, detecting the first object includes: a second set of candidate boxes associated with the first object in the sub-image is determined, wherein the first set of candidate boxes is determined by a first recognition model and the second set of candidate boxes is determined by a second recognition model having a lower model complexity than the first recognition model.

In some implementations, determining a first set of candidate boxes in a first image captured by an image capture device includes: adjusting the first image to adapt the first recognition model; and processing the adjusted first image using the first recognition model to determine a first set of candidate boxes.

In some implementations, the first recognition model determines the first set of candidate boxes based on a first confidence threshold, and the second recognition model determines the second set of candidate boxes based on a second confidence threshold, the first confidence threshold being lower than the second confidence threshold.

In some implementations, the process 300 further includes: based on the matches between the first set of candidate boxes and the second set of candidate boxes, tracking information of the first object at the second time is determined, the tracking information including at least one of: position, direction of movement, or rate of movement.

In some implementations, the process 300 further includes: processing a third image captured by the image capturing device at a third time instant with the first recognition model to detect a second object included in the third image, wherein the third time instant is later than the first time instant by a predetermined time interval, and providing a sub-image of at least one image captured by the image capturing device between the first time instant and the second time instant, corresponding to the target area, to the second recognition model for object detection.

In some implementations, the process 300 further includes: acquiring a third set of candidate frames determined by the first recognition model based on a fourth image captured by the image capturing device at a fourth time instant, the fourth time instant being earlier than the first time instant by a predetermined time interval; determining motion information of a third object at the first moment based on a match between the third set of candidate boxes and the first set of candidate boxes, the third object being associated with a predetermined candidate box of the first set of candidate boxes outside the target area; and determining tracking information of the third object at the second moment based on the predetermined candidate frame and the motion information in the first image.

In some implementations, determining the target region includes: determining a target candidate frame in the first group of candidate frames based on the positions of the first group of candidate frames, wherein the number of adjacent candidate frames of the target candidate frame is higher than a preset threshold value, and the distance between the adjacent candidate frame and the target candidate frame is smaller than the preset threshold value; and determining a target area having a predetermined size based on the position of the target candidate frame.

In some implementations, determining the target region includes: a target region having a predetermined size is determined based on the locations of the first set of candidate frames such that the target region covers at least one candidate frame of the first set of candidate frames, wherein a ratio of a first number of candidate frames of the at least one candidate frame to a second number of candidate frames of the first set of candidate frames is greater than a predetermined threshold.

In some implementations, the image capture device is more than a predetermined threshold distance from the capturable object.

In some implementations, the first image and the second image are adjacent video frames captured by the image capture device.

According to the scheme discussed above, the fact of the present disclosure enables object detection using sub-images having smaller sizes in the subsequent object detection process, thereby providing accuracy of object detection while ensuring detection instantaneity.

Example apparatus and apparatus

Embodiments of the present disclosure also provide corresponding apparatus for implementing the above-described methods or processes. Fig. 4 illustrates a schematic block diagram of an apparatus 400 for object detection according to some embodiments of the present disclosure.

As shown in fig. 4, the apparatus 400 may comprise a first candidate box determination unit 410 configured to determine a first set of candidate boxes in a first image captured by the image capturing apparatus at a first time, the first set of candidate boxes indicating regions in the first image that may include an object.

The apparatus 400 further comprises a target region determination unit 420 configured to determine a target region having a predetermined size based on the first set of candidate boxes.

The apparatus 400 further comprises a sub-image determination unit 430 configured to determine a sub-image corresponding to the target area from a second image captured by the image capturing apparatus at a second time instant, the second time instant being later than the first time instant.

The apparatus 400 further comprises an object detection unit 440 configured to detect the first object from the sub-image.

In some implementations, the object detection unit 440 includes: and a second candidate frame determination unit configured to determine a second set of candidate frames associated with the first object in the sub-image, wherein the first set of candidate frames is determined by a first recognition model and the second set of candidate frames is determined by a second recognition model, the second recognition model having a lower model complexity than the first recognition model.

In some implementations, the first candidate block determination unit 410 includes: an adjustment unit configured to adjust the first image to adapt the first recognition model; and a processing unit configured to process the adjusted first image using the first recognition model to determine a first set of candidate boxes.

In some implementations, the apparatus 400 further includes: a first tracking module configured to determine tracking information of the first object at the second time based on a match between the first set of candidate boxes and the second set of candidate boxes, the tracking information including at least one of: position, direction of movement, or rate of movement.

In some implementations, the apparatus 400 further includes: processing a third image captured by the image capturing device at a third time instant with the first recognition model to detect a second object included in the third image, wherein the third time instant is later than the first time instant by a predetermined time interval, and providing a sub-image of at least one image captured by the image capturing device between the first time instant and the second time instant, corresponding to the target area, to the second recognition model for object detection.

In some implementations, the process 300 further includes: a third candidate frame acquisition unit configured to acquire a third group of candidate frames determined by the first recognition model based on a fourth image captured by the image capturing device at a fourth time instant, the fourth time instant being earlier than the first time instant by a predetermined time interval; a matching unit configured to determine motion information of a third object at the first time based on a match between the third set of candidate frames and the first set of candidate frames, the third object being associated with a predetermined candidate frame of the first set of candidate frames outside the target area; and a second tracking unit configured to determine tracking information of the third object at the second time based on the predetermined candidate frame and the motion information in the first image.

In some implementations, the target area determination unit 420 includes: a target candidate frame determination configured to determine a target candidate frame in the first set of candidate frames based on the locations of the first set of candidate frames, the number of neighboring candidate frames of the target candidate frame being above a predetermined threshold, the distance of the neighboring candidate frame from the target candidate frame being less than the predetermined threshold; and an expansion unit configured to determine a target area having a predetermined size based on the position of the target candidate frame.

In some implementations, the target area determination unit 420 includes: and a clipping unit configured to determine a target area having a predetermined size based on the positions of the first set of candidate frames such that the target area covers at least one candidate frame of the first set of candidate frames, wherein a ratio of a first number of candidate frames of the at least one candidate frame to a second number of candidate frames of the first set of candidate frames is greater than a predetermined threshold.

The elements included in apparatus 400 may be implemented in a variety of ways, including software, hardware, firmware, or any combination thereof. In some embodiments, one or more units may be implemented using software and/or firmware, such as machine executable instructions stored on a storage medium. In addition to or in lieu of machine-executable instructions, some or all of the elements in apparatus 400 may be at least partially implemented by one or more hardware logic components. By way of example and not limitation, exemplary types of hardware logic components that can be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standards (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

Fig. 5 illustrates a block diagram of a computing device/server 500 in which one or more embodiments of the disclosure may be implemented. It should be understood that the computing device/server 500 illustrated in fig. 5 is merely exemplary and should not be construed as limiting the functionality and scope of the embodiments described herein.

As shown in fig. 5, computing device/server 500 is in the form of a general purpose computing device. Components of computing device/server 500 may include, but are not limited to, one or more processors or processing units 510, memory 520, storage 530, one or more communication units 540, one or more input devices 550, and one or more output devices 560. The processing unit 510 may be a real or virtual processor and is capable of performing various processes according to programs stored in the memory 520. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capabilities of computing device/server 500.

Computing device/server 500 typically includes a number of computer storage media. Such media may be any available media that is accessible by computing device/server 500 and includes, but is not limited to, volatile and non-volatile media, removable and non-removable media. The memory 520 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 530 may be a removable or non-removable media and may include machine-readable media such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data (e.g., training data for training) and may be accessed within computing device/server 500.

The computing device/server 500 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in fig. 5, a magnetic disk drive for reading from or writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data medium interfaces. Memory 520 may include a computer program product 525 having one or more program modules configured to perform the various methods or acts of the various embodiments of the present disclosure.

Communication unit 540 enables communication with other computing devices via a communication medium. Additionally, the functionality of the components of computing device/server 500 may be implemented in a single computing cluster or in multiple computing machines capable of communicating over a communication connection. Accordingly, computing device/server 500 may operate in a networked environment using logical connections to one or more other servers, a network Personal Computer (PC), or another network node.

The input device 550 may be one or more input devices such as a mouse, keyboard, trackball, etc. The output device 560 may be one or more output devices such as a display, speakers, printer, etc. The computing device/server 500 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., as needed through the communication unit 540, with one or more devices that enable a user to interact with the computing device/server 500, or with any device (e.g., network card, modem, etc.) that enables the computing device/server 500 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).

According to an exemplary implementation of the present disclosure, a computer-readable storage medium is provided, on which one or more computer instructions are stored, wherein the one or more computer instructions are executed by a processor to implement the method described above.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of implementations of the present disclosure has been provided for illustrative purposes, is not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various implementations described. The terminology used herein was chosen in order to best explain the principles of each implementation, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand each implementation disclosed herein.

Example implementation

TS 1. A method of object detection, comprising:

determining a first set of candidate boxes in a first image captured by an image capture device at a first time, the first set of candidate boxes indicating regions in the first image that may include an object;

determining a target region having a predetermined size based on the first set of candidate boxes;

Determining a sub-image corresponding to the target region from a second image captured by the image capturing device at a second time instant, the second time instant being later than the first time instant; and

A first object is detected from the sub-image.

TS 2. The method of TS 1, wherein detecting the first object comprises: a second set of candidate boxes in the sub-image associated with the first object is determined,

Wherein the first set of candidate boxes is determined by a first recognition model and the second set of candidate boxes is determined by a second recognition model having a lower model complexity than the first recognition model.

TS 3. The method of TS 2, wherein determining a first set of candidate boxes in a first image captured by an image capture device comprises:

adjusting the first image to adapt the first recognition model; and

The adjusted first image is processed using the first recognition model to determine the first set of candidate boxes.

TS 4. The method according to TS 2, wherein a first recognition model determines the first set of candidate boxes based on a first confidence threshold, and the second recognition model determines a second set of candidate boxes based on a second confidence threshold, the first confidence threshold being lower than the second confidence threshold.

TS 5. The method according to TS 2, further comprising:

Determining tracking information of the first object at the second time based on a match between the first set of candidate boxes and the second set of candidate boxes, the tracking information including at least one of: position, direction of movement, or rate of movement.

TS 6. The method according to TS 2, further comprising:

Processing a third image captured by the image capturing device at a third time with the first recognition model to detect a second object included in the third image, wherein the third time is later than the first time by a predetermined time interval, a sub-image of at least one image captured by the image capturing device between the first time and the second time, which corresponds to the target area, being provided to the second recognition model for object detection.

TS 7. The method according to TS 2, further comprising:

obtaining a third set of candidate boxes determined by the first recognition model based on a fourth image captured by the image capture device at a fourth time instant, the fourth time instant being a predetermined time interval earlier than the first time instant;

Determining motion information of a third object at the first time based on a match between the third set of candidate boxes and the first set of candidate boxes, the third object being associated with a predetermined candidate box of the first set of candidate boxes outside the target area; and

Tracking information of the third object at the second moment is determined based on the predetermined candidate frame in the first image and the motion information.

TS 8. The method according to TS 1, wherein determining the target area comprises:

Determining a target candidate frame in a first group of candidate frames based on the positions of the first group of candidate frames, wherein the number of adjacent candidate frames of the target candidate frame is higher than a preset threshold value, and the distance between the adjacent candidate frame and the target candidate frame is smaller than the preset threshold value; and

The target region having the predetermined size is determined based on the position of the target candidate frame.

TS 9. The method according to TS 1, wherein determining the target area comprises:

The target region having the predetermined size is determined based on the location of a first set of candidate boxes such that the target region covers at least one candidate box of the first set of candidate boxes, wherein a ratio of a first number of candidate boxes of the at least one candidate box to a second number of candidate boxes of the first set of candidate boxes is greater than a predetermined threshold.

TS 10. The method according to TS 1, wherein the distance of the image capturing device from the capturable object is greater than a predetermined threshold.

TS 11. The method of TS 1, wherein the first image and the second image are adjacent video frames captured by the image capture device.

TS 12. An apparatus for object detection, comprising:

A first candidate box determination module configured to determine a first set of candidate boxes in a first image captured by an image capture device at a first time, the first set of candidate boxes indicating regions in the first image that are likely to include an object;

a target region determination module configured to determine a target region based on the first set of candidate boxes;

a sub-image determining module configured to determine a sub-image corresponding to the target area from a second image captured by the image capturing device at a second time, the second time being later than the first time; and

An object detection module configured to detect a first object from the sub-image.

TS 13. An electronic device, comprising:

a memory and a processor;

Wherein the memory is for storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to any one of TS 1 to 11.

A TS 14. A computer readable storage medium having stored thereon one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the method according to any of TS 1 to 11.

Claims

1. A method of object detection, comprising:

determining a target region having a predetermined size based on the first set of candidate boxes, including:

Determining a target candidate frame in the first group of candidate frames based on the positions of the first group of candidate frames, wherein the number of adjacent candidate frames of the target candidate frame is higher than a preset threshold value, and the distance between the adjacent candidate frame and the target candidate frame is smaller than the preset threshold value;

determining the target region having the predetermined size based on the position of the target candidate frame;

A first object is detected from the sub-image.

2. The method of claim 1, wherein detecting the first object comprises: a second set of candidate boxes in the sub-image associated with the first object is determined,

3. The method of claim 2, wherein determining a first set of candidate boxes in a first image captured by an image capture device comprises:

adjusting the first image to adapt the first recognition model; and

4. The method of claim 2, wherein the first recognition model determines the first set of candidate boxes based on a first confidence threshold, the second recognition model determines the second set of candidate boxes based on a second confidence threshold, the first confidence threshold being lower than the second confidence threshold.

5. The method of claim 2, further comprising:

6. The method of claim 2, further comprising:

7. The method of claim 2, further comprising:

a third set of candidate boxes determined by the first recognition model based on a fourth image is acquired,

The fourth image is captured by the image capturing device at a fourth time instant, the fourth time instant being a predetermined time interval earlier than the first time instant;

8. The method of claim 1, wherein determining the target region comprises:

Determining the target area having the predetermined size based on the location of a first set of candidate boxes such that the target area covers at least one candidate box of the first set of candidate boxes,

Wherein a ratio of a first number of candidate frames of the at least one candidate frame to a second number of candidate frames of the first set of candidate frames is greater than a predetermined threshold, the first number of candidate frames being a number of the at least one candidate frames within the target region, the second number of candidate frames being a number of the first set of candidate frames.

9. The method of claim 1, wherein the image capture device is more than a predetermined threshold distance from a capturable object.

10. The method of claim 1, wherein the first image and the second image are adjacent video frames captured by the image capture device.

11. An apparatus for object detection, comprising:

a target region determination module configured to determine a target region having a predetermined size based on the first set of candidate boxes, including:

12. An electronic device, comprising:

a memory and a processor;

Wherein the memory is for storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of any of claims 1 to 10.

13. A computer readable storage medium having stored thereon one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the method of any of claims 1 to 10.