[go: up one dir, main page]

CN110188597B - Crowd counting and localization method and system based on circular scaling of attention mechanism - Google Patents

Crowd counting and localization method and system based on circular scaling of attention mechanism Download PDF

Info

Publication number
CN110188597B
CN110188597B CN201910293903.6A CN201910293903A CN110188597B CN 110188597 B CN110188597 B CN 110188597B CN 201910293903 A CN201910293903 A CN 201910293903A CN 110188597 B CN110188597 B CN 110188597B
Authority
CN
China
Prior art keywords
crowd
branch
counting
map
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910293903.6A
Other languages
Chinese (zh)
Other versions
CN110188597A (en
Inventor
陈刚
刘臣臣
王成成
黄波
韩峻
糜俊青
翁昕钰
穆亚东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongxing Micro Technology Co ltd
Peking University
Original Assignee
Zhongxing Technology Co ltd
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongxing Technology Co ltd, Peking University filed Critical Zhongxing Technology Co ltd
Publication of CN110188597A publication Critical patent/CN110188597A/en
Application granted granted Critical
Publication of CN110188597B publication Critical patent/CN110188597B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30242Counting objects in image

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本发明涉及一种基于注意力机制循环缩放的密集人群计数与精确定位方法和系统。与原有的基于密度图的人群计数方法以及通过人脸或者行人检测获取人群数量估计的方法不同,本发明通过精心设计的三分支的深度神经网络分别获取输入图像对应的人群计数密度图、人群位置分布图以及用于获取密集候选注意力图。通过人群计数密度图得到图像中初始的人群计数值;通过人群位置分布图得到图像中每个人物的位置坐标;通过密集区域候选图得到图像中人群密集的若干区域,将这些区域从原图中剪切出来并将分辨率放大为原来的两倍,送进后面的网络得到更加精确的人物定位结果。

Figure 201910293903

The present invention relates to a dense crowd counting and precise positioning method and system based on attention mechanism cyclic scaling. Different from the original crowd counting method based on density map and the method of obtaining crowd number estimation through face or pedestrian detection, the present invention obtains the crowd counting density map, crowd corresponding to the input image through a well-designed three-branch deep neural network. Location distribution map and attention map for obtaining dense candidates. Obtain the initial crowd count value in the image through the crowd count density map; obtain the position coordinates of each person in the image through the crowd location distribution map; obtain several densely crowded areas in the image through the dense area candidate map, and extract these areas from the original image. Cut it out and enlarge the resolution to twice the original, and send it to the back network to get more accurate character positioning results.

Figure 201910293903

Description

Crowd counting and positioning method and system based on attention mechanism cyclic scaling
Technical Field
The invention relates to a method for counting dense crowds and accurately positioning crowds in an image, in particular to a method and a system for acquiring the accurate positioning of the crowds by using an attention mechanism and cyclic zooming, and belongs to the field of computer vision.
Background
With the urbanization development of society, the population number of cities rises sharply, and video monitoring cameras are densely installed around many cities and are used more and more in daily work and life. One of the most important application areas of these video data is intelligent video surveillance. In China with 13 hundred million people, a series of problems caused by large population number always threaten public safety. Also in other parts of the world, uncontrollable events can occur when large activities are taking place because of the crowds being too dense. Therefore, the public safety personnel can be reasonably allocated by effectively utilizing the security monitoring data, and the guiding and shunting of the crowd by constructing auxiliary traffic facilities are of great significance to the maintenance of public order and the protection of personal safety. However, the conventional video monitoring needs manual monitoring, processing and reporting, and is very labor and material consuming. The automated video analysis and processing not only liberates labor force, but also mines data from massive video information and learns useful knowledge and rules. The crowd counting is taken as a field in video analysis, and has important significance for various aspects such as crowd pedestrian analysis, emergency monitoring, traffic planning and the like.
The existing population counting technology mainly comprises two categories of integral estimation based on a density map and population estimation based on human face or pedestrian detection. With the development of deep learning technology, many researchers learn to obtain a density map of people by using a deep neural network, and obtain the number of people in a picture by integrating the density map, and this method has achieved good accuracy, but the main disadvantage of this method is that although the learned density map integral value is equivalent to the number of people in the picture, the learned density map distribution and the actual density map distribution have a large difference, which is not beneficial to further people analysis.
The development of deep learning also makes great progress in the conventional target detection task, so that researchers estimate the number of people by detecting faces or pedestrians appearing in images. Although the method can accurately give the position of a person and avoid the defect of inaccurate distribution prediction based on a density map method, the method also has a great problem, the existing face or pedestrian detector has poor effect in an ultra-dense scene, and people estimation is always in the ultra-dense scene, so that the people are difficult to see clearly the face or the body of the person, and the method is difficult to obtain good effect in the scene.
Disclosure of Invention
Aiming at the problems that the prediction is inaccurate based on a density map method in the dense crowd counting and the effect of a detection-based method on a dense scene is poor, the invention aims to provide a solution method and a system for the dense crowd counting and the accurate positioning based on the attention mechanism cyclic scaling. The invention provides a cyclic scaling network based on an attention mechanism by using a deep learning method, and the network converts the problem of crowd quantity estimation in the original dense picture into two problems of crowd initial estimation and crowd accurate positioning.
Different from the original population counting method based on the density map and the method for acquiring the population quantity estimation through face or pedestrian detection, the method respectively acquires the population counting density map, the population position distribution map and the zoom candidate region attention map corresponding to the input image through the well-designed three-branch deep neural network. Obtaining an initial crowd counting value in the image through the crowd counting density map; obtaining the position coordinates of each person in the image through the crowd position distribution diagram; and (4) obtaining a plurality of areas with dense people in the image by zooming the candidate areas, cutting the areas from the original image, amplifying the resolution to twice of the original resolution, and sending the areas to a subsequent circular zooming network to obtain a more accurate person positioning result. The crowd counting values can be obtained from the crowd counting density graph and the crowd position distribution graph, and the invention further provides a scene self-adaptive weight which is combined to weight the two obtained crowd counting values to obtain more accurate crowd quantity estimation.
The invention relates to a dense crowd counting and accurate positioning method based on attention mechanism cyclic scaling, which comprises the following steps of:
1) establishing a three-branch deep neural network, and respectively obtaining a crowd counting density graph, a crowd position distribution graph and a zoom candidate region attention graph corresponding to an input image;
2) obtaining an initial crowd counting value in the image through the crowd counting density map, obtaining the position coordinate of each person in the image through the crowd position distribution map, and obtaining a plurality of areas with dense crowd in the image through the zoom candidate area attention map;
3) cutting out a plurality of regions with dense crowd from the image, obtaining an accurate positioning result by improving the resolution, and updating the crowd position distribution map by using the accurate positioning result;
4) and obtaining an accurate crowd count value by weighting according to the crowd count value obtained according to the crowd count density map and the crowd count value obtained according to the crowd position distribution map.
The above process is further described below. The specific flow of the method is schematically shown in fig. 1, and comprises the following steps:
step 1: network structure construction and parameter initialization. As shown in fig. 1, the method proposed by the present invention includes two main networks: a main network (MainNet) and a circular scaling network (current Attention zoom Net, referred to as RAZNet), the MainNet includes a Localization Branch (Localization Branch), a Counting Branch (Counting Branch), and a scaling candidate Region Branch (zoom Region Branch).
The MainNet takes the first 13 layers of the VGG-16 network as a basic network, a positioning branch is composed of a hole convolution layer (scaled convolution layers) and 3 deconvolution layers (scaled layers), and the branch finally outputs a layer of feature map with the same resolution size as the original picture; the counting branch is only composed of a hole convolution layer and outputs a characteristic diagram with the resolution of 1/8 of the original picture; and splicing the feature maps output by the positioning branch and the counting branch after the up-sampling to be used as the input of a zoom candidate region branch (zoom region deployment branch).
RAZnet has fewer counting branches than MainNet, and the rest is consistent with MainNet. Parameters obtained by training VGG-16 on an ImageNet data set are used as initialization parameters of a MainNet basic network, and RAZNet parameters obtained by training are used as initialization parameters.
Step 2: and (5) training a model. In order to facilitate model convergence, three branches are trained in sequence according to the sequence of counting the branches, positioning the branches and scaling the candidate region branches. And after the MainNet training is finished, using the MainNet as an initialization parameter of the RAZNet to finely adjust the RAZNet.
Step 3: and selecting the fusion weight. After model training is completed, people can respectively obtain the crowd counting values obtained by the positioning branch and the counting branch on a training set, and according to the real crowd counting value corresponding to the image, people can learn the fusion weight between the counting values of the positioning branch and the counting branch, and the weight enables the predicted value to be closer to the real value.
Step 4: and (4) reasoning of the network. After the model training is completed, for each test picture, a crowd density map, a crowd position distribution map and a zoom candidate region attention map which are acquired from the MainNet are acquired according to the attention map, a plurality of dense regions are acquired from the original map, the regions are cut out from the original map, the length and the width are enlarged to be twice of the original length and the width, and the pictures are subjected to RAZNet to acquire a new crowd position distribution map and a zoom candidate region attention map. When the zoom candidate region attention map does not find a new dense region, the whole inference ends.
Step 5: and obtaining the final crowd counting value and the head position coordinate. We take the position of the peak point in the crowd position distribution map as the final predicted head coordinates. In order to obtain the peak point, we first perform non-maximum suppression (NMS) on the crowd position distribution map, and then take all position points whose response values are greater than a certain threshold as the anchor points of the head. According to the fusion weight obtained at Step3, the crowd counting result after the fusion of the counting branch and the positioning branch is calculated and used as the final crowd counting value.
As shown in fig. 1, the method includes two basic network modules, MainNet and RAZNet, where MainNet has three branches, and RAZNet has two branches, and the names and functions of the network modules and the branches are:
1. primary network (MainNet): initial population counting and rough population positioning are carried out on the input initial picture, and the obtained zoom candidate area attention of the network is used for guiding the shearing and the amplification of the subsequent dense area.
2. Circular scaling network (RAZNet): and carrying out crowd positioning on the selected dense area in the MainNet to obtain a more accurate positioning result of the local area. The network itself can get a zoom candidate region attention map from which it is decided whether to cut the region again to continue through the RAZNet.
3. Localization Branch (Localization Branch): and acquiring features from the basic network, and outputting a crowd position distribution map with the same resolution as that of the network input image through 6 hole convolution layers and 3 intervening deconvolution layers.
4. Counting Branch (Counting Branch): characteristics are obtained from the basic network, and a crowd density map with the length and the width respectively equal to the size of the network input image 1/8 is output through 6 hole convolution layers.
5. Zoom candidate Region Branch (Zooming Region pro boss): and acquiring features from the positioning branch and the counting branch, splicing the features together to serve as the input of the branch, and outputting a zoom candidate area attention diagram with the same resolution size as that of the network input image through the 3 hole convolution layers.
Correspondingly, the invention also provides a dense crowd counting and accurate positioning system based on the attention mechanism cycle scaling, which comprises:
the main network module comprises a three-branch deep neural network and is used for respectively acquiring a crowd counting density graph, a crowd position distribution graph and a zoom candidate region attention graph corresponding to the input image; obtaining an initial crowd counting value in the image through the crowd counting density map, obtaining the position coordinate of each person in the image through the crowd position distribution map, and obtaining a plurality of areas with dense crowd in the image through the zoom candidate area attention map; cutting out a plurality of areas with dense crowd from the image and improving the resolution ratio of the areas;
the circulating scaling network module is responsible for taking the areas with the dense crowd after the resolution ratio is improved as input to obtain an accurate person positioning result and updating the crowd position distribution map by using the accurate person positioning result;
and the fusion counting module is in charge of obtaining an accurate crowd counting value through weighting by utilizing the crowd counting value obtained according to the crowd counting density graph and the crowd counting value obtained according to the crowd position distribution graph.
Compared with the existing population counting technology, the intensive population counting and accurate positioning method based on the attention mechanism for cyclic scaling has the following advantages:
1. the technology described in the invention can accurately give the position of the person in the picture.
2. The crowd dense area in the image can be automatically found out through an attention mechanism, and an accurate positioning result is obtained by improving the resolution of the dense area.
3. By the aid of the scene self-adaptive weight, the results obtained by counting the crowd and positioning the crowd are fused, and the accuracy of the crowd counting is improved.
Drawings
FIG. 1 is a schematic diagram of a network architecture;
fig. 2 is a schematic diagram of a dense candidate region of an attention-generating human group.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.
1. Target data generation
When the model is trained, the image and the corresponding real crowd counting density graph, the real crowd position distribution graph (head position graph) and the real zoom candidate region attention graph are required to be used as training data.
(1) Real population count density map generation: by referring to the previous work of people counting, a corresponding people density graph is generated according to the head coordinates given in the labeling data. The crowd density graph is generated according to the following formula, for each marked head, a Gaussian convolution is introduced, and for each pixel coordinate point x in the real crowd count density graph, the density value of the pixel coordinate point x is
Figure BDA0002025838660000051
Is calculated as shown in the following formula, wherein N is the total number of human heads in the image, and the coordinate points of the N human heads are represented as x1,...,xn
Figure BDA0002025838660000052
Is a distance xiAverage distance of nearest 4 persons' heads, ZiFor each human head corresponding to the normalization parameter of the Gaussian convolution, beta is a scaling factor of the distance, and the value is 0.1 according to the experience.
Figure BDA0002025838660000053
(2) Generating a real head position diagram: and setting each head mark point and the four adjacent domains corresponding to the head mark point as 1 to obtain a final head position diagram.
(3) The true zoom candidate region attention map is generated: the method comprises the steps of finding three head positions closest to a pixel point for each pixel point in a graph, calculating the average value of three distances, carrying out Gaussian transformation on the value to obtain a response value corresponding to the pixel point, and trying to reflect the density degree of people in different areas.
2. And constructing a network structure.
The invention relates to a method for counting crowds and accurately positioning the crowds by using deep learning, wherein the structural design of a deep neural network is shown in figure 1. The network comprises two main networks of MainNet and RAZNet, wherein the MainNet takes the first 13 layers of VGG-16 as a basic network and is followed by three parts of a positioning branch, a counting branch and a zooming candidate region branch, and the RAZNet consists of two parts of the positioning branch and the zooming candidate region branch. The detailed configuration parameters for MainNet and RAZNet are set forth in Table 1 below.
TABLE 1 configuration parameters for MainNet and RAZNet
Figure BDA0002025838660000054
Figure BDA0002025838660000061
Training procedures for MainNet and RAZ-Net.
We first train MainNet. As can be seen from step2, the MainNet is composed of three parts, namely a counting branch, a positioning branch and a zoom candidate region branch, and in order to facilitate model convergence, the model is trained sequentially according to the order of the counting branch, the positioning branch and the zoom candidate region branch.
(1) For a counting branch, the MSE loss between a density map output by the branch and a real density map is taken as an optimization objective function, the MSE is calculated in the following formula, and epsilonden(I) Is the loss value on picture I, where m, n represent the height and width of the input picture, phi (p) and
Figure BDA0002025838660000062
respectively representing the corresponding prediction and true values on the p-th pixel point in the output crowd counting density graph.
Figure BDA0002025838660000063
(2) After the counting branch is converged, the parameter learned by the counting branch is used as the initialization parameter of the positioning branch, the positioning branch is different from the counting branch,taking weighted cross entropy loss (BCE) between the predicted head position diagram and the real head position diagram as an optimization objective function, epsilonloc(I) Is the BCE loss value on picture I, where m, n respectively represent the height and width of the input picture, Y (x)p) The real value corresponding to the p-th pixel point is shown, psi (p) shows the predicted value on the p-th pixel point, gamma is a weight value, and the value is 100 according to experience.
Figure BDA0002025838660000064
l(xp)=-γ·Y(xp)·log(ψ(p))-(1-Y(xp))·log(1-ψ(p))
(3) After learning of the counting branch and the positioning branch is finished, the parameters of the two branches are fixed, and the branch of the zooming candidate area begins to be trained, wherein the MSE loss of the branch is used as an optimization objective function.
After the MainNet training is completed, we train RAZNet, which only retains the localization branch and the scaling candidate region branch. Unlike MainNet, we find several densely populated areas of the original image from the attention map as training samples of RAZNet according to fig. 2. Because the network structures of the RAZNet and the MainNet are basically consistent, the parameters learned in the MainNet are used as the initialization parameters of the RAZNet, and the positioning branch and the zooming candidate region branch are finely adjusted in sequence.
4. And counting the branches to obtain the total number of the human objects in the image.
And integrating the crowd distribution density graph obtained by counting the branches to calculate the total number of people appearing in the image predicted by the branches.
5. The positioning branch obtains the head positioning coordinates and the total number of the persons appearing in the image.
The positioning branch obtains a human head position distribution diagram, and we need to take out a local peak point in the diagram and obtain a final human head coordinate after a non-maximum suppression (NMS) operation.
1) Firstly, obtaining an average pooling of kernel size of 3x3 on a human head position distribution diagram for highlighting possible peak points in a local area;
2) performing maximum pooling with a kernel size of 3x3 on the basis of the first step, and comparing the maximum pooled data with the previous distribution map at a pixel level, wherein the same positions of the previous distribution map and the next distribution map are required local peak points;
3) taking the peak point with the response value larger than a certain threshold value in the distribution diagram obtained in the second step as the finally obtained head positioning coordinate;
4) the total number of the persons appearing in the image can be obtained by counting the obtained head positioning coordinates.
6. And (4) learning fusion weights of the positioning branch and the counting branch according to the scene.
After model training is finished, according to step5 and step 6, the population count values of the positioning branch and the counting branch can be obtained on a training set, and according to the real population count value corresponding to the existing image, a fusion weight (the fusion weight is represented by alpha in fig. 1) between the count values of the positioning branch and the counting branch can be learned, and the weight enables a predicted value to be closer to a real value. For example, when the difference between the crowd count value obtained in the counting branch and the crowd count value obtained in the positioning branch is greater than 150, the value obtained in the counting branch is more accurate, and the user chooses to believe the result obtained in the counting branch.
Fusion of results from the localization branches of MainNet and RAZNet.
According to the design of the network, the result obtained by the RAZNet is the accurate positioning result of a certain area in the original image, theoretically, the result is more accurate than the result obtained by the positioning branch in the MainNet, the detection result of the RAZNet in a certain area is used for replacing the detection result of the area in the MainNet, and the task of accurately positioning the part of the human head can be completed.
8. And obtaining self-adaptive fusion weight according to the scene, and fusing the technical results based on the density map and the detection to improve the accuracy of the head counting task.
And (4) according to the learned fusion weight of the positioning branch and the counting branch obtained in the step (6), fusing the results obtained by the two branches in the test set to obtain a final crowd counting value.
The present invention performs on three datasets ShanghaiTech _ A, ShanghaiTech _ B and UCF _ QNRF commonly used for population counting as shown in Table 2. The performance on evaluation index Mean Absolute Error (MAE) and Mean Squared Error (MSE) is superior to that of the former method. The "-" in the table indicates that the method does not report performance on this data set.
TABLE 2 comparison of the Effect of the invention with other methods
Figure BDA0002025838660000081
In contrast to the present invention, there are MCNN (Y.Zhang, D.ZHou, S.Chen, S.Gao, and Y.Ma.Single-image crowned measuring via multi-column spherical connecting network. in CVPR,2016.3,6,7), Switch-CNN (D.B.Sam, S.Surya, and R.V.Babu.switching connecting spherical connecting network for crowned measuring. in CVPR,2017.3,7), CP-CNN (V.A.Sindagi and V.M.Patel. generating high-quality crowned measuring connecting network, in CVPR,2017.3,7), CSRNX.Zhang, S.Cheng.Cheng.CsN, CsN.Zhang, S.C.T.C.S.C.C.C.C.C.C.C.C.C.C.C.C.C.C.S.C.C.S.C.C.C.S.C.C.S.C.C.S.C.C.S.C.S.C.S.C.C.S.C.S.C.S.C.S.C.C.S.S.C.S.C.S.S.C.S.C.S.S.C.S.S.S.C.S.C.C.S.S.S.C.S.S.C.
Another embodiment of the present invention provides a dense population counting and pinpointing system based on attention mechanism cycle scaling, comprising:
the main network module comprises a three-branch deep neural network and is used for respectively acquiring a crowd counting density graph, a crowd position distribution graph and a zoom candidate region attention graph corresponding to the input image; obtaining an initial crowd counting value in the image through the crowd counting density map, obtaining the position coordinate of each person in the image through the crowd position distribution map, and obtaining a plurality of areas with dense crowd in the image through the zoom candidate area attention map; cutting out a plurality of areas with dense crowd from the image and improving the resolution ratio of the areas;
the circulating scaling network module is responsible for taking the areas with the dense crowd after the resolution ratio is improved as input to obtain an accurate person positioning result and updating the crowd position distribution map by using the accurate person positioning result;
and the fusion counting module is in charge of obtaining an accurate crowd counting value through weighting by utilizing the crowd counting value obtained according to the crowd counting density graph and the crowd counting value obtained according to the crowd position distribution graph.
In the invention, the basic network of the MainNet can be replaced by a stronger VGG19 or Resnet series model from VGG16, and the stronger basic network model can bring better effect.
In the invention, during RAZNet training, the resolution can be amplified to be twice or higher than the original resolution within the range allowed by video memory.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A dense crowd counting and accurate positioning method based on attention mechanism cyclic scaling is characterized by comprising the following steps:
1) establishing a three-branch deep neural network, and respectively obtaining a crowd counting density graph, a crowd position distribution graph and a zoom candidate region attention graph corresponding to an input image;
2) obtaining an initial crowd counting value in the image through the crowd counting density map, obtaining the position coordinate of each person in the image through the crowd position distribution map, and obtaining a plurality of areas with dense crowd in the image through the zoom candidate area attention map;
3) cutting out a plurality of regions with dense crowd from the image, obtaining an accurate positioning result by improving the resolution, and updating the crowd position distribution map by using the accurate positioning result;
4) and weighting to obtain an accurate crowd count value by utilizing the crowd count value obtained according to the crowd count density map and the crowd count value obtained according to the crowd position distribution map updated in the step 3).
2. The method of claim 1, wherein the three-branch deep neural network constitutes a main network, the main network comprising a localization branch, a counting branch, and a scaling candidate region branch; the positioning branch consists of a cavity convolution layer and 3 deconvolution layers, and finally a crowd position distribution diagram with the same resolution as that of the original picture is output; the counting branch is only composed of a cavity convolution layer and outputs a crowd counting density map with the original picture resolution of 1/8; and splicing the feature maps output by the positioning branch and the counting branch to serve as the input of the zooming candidate region branch, wherein the zooming candidate region branch outputs a zooming candidate region attention map with the same resolution as that of the input image through 3 hole convolution layers.
3. A method according to claim 1 or 2, wherein the increasing of the resolution is an enlargement of the resolution by a factor of two.
4. The method of claim 2, wherein the accurate positioning result obtained by increasing the resolution is obtained by sending the areas with increased resolution, which are densely populated, to a circular scaling network; the circular scaling network contains no counting branches, the rest of which is consistent with the main network.
5. The method of claim 4, wherein the circular scaling network itself can obtain a scaling candidate region attention map, and the decision whether to cut the region again is made according to the scaling candidate region attention map and continue to pass through the circular scaling network until no new crowd-dense region can be found in the scaling candidate region attention map.
6. The method of claim 4, wherein the three branches of the main network are trained in sequence in the order of a count branch, a locate branch, a zoom candidate area branch; and taking the trained parameter of the main network as an initialization parameter of the cyclic scaling network, and finely adjusting the cyclic scaling network.
7. The method of claim 6, wherein for a counting branch, the MSE loss between the crowd density map output by the branch and the real crowd density map is used as an optimization objective function to perform gradient update on the model parameters of the branch; after the counting branch is converged, taking the learned parameters of the counting branch as initialization parameters of a positioning branch, taking BCE loss with weight between a predicted head position diagram and a real head position diagram of the positioning branch as an optimization objective function, and performing gradient updating on model parameters of the branch; after learning of the counting branch and the positioning branch is finished, parameters of the two branches are fixed, and the branch of the zooming candidate area begins to be trained, wherein MSE loss is used as an optimization objective function.
8. The method of claim 1, wherein the accurate population count value is obtained by weighting the population by:
a) obtaining a crowd count value on the training set according to the crowd count density graph and the crowd position distribution graph respectively;
b) learning the fusion weight between the two crowd counting values obtained in the step a) according to the real crowd counting value corresponding to the existing image.
9. The method of claim 1, wherein the population count value is derived from the population position profile by:
a) performing non-maximum value inhibition on the crowd position distribution diagram, and then taking all position points with response values larger than a certain threshold value as peak value points;
b) taking the position of the peak point in the crowd position distribution map as a head positioning coordinate;
c) the total number of persons appearing in the image is obtained by counting the head positioning coordinates.
10. A dense crowd counting and pinpointing system based on attention mechanism cycle scaling, comprising:
the main network module comprises a three-branch deep neural network and is used for respectively acquiring a crowd counting density graph, a crowd position distribution graph and a zoom candidate region attention graph corresponding to the input image; obtaining an initial crowd counting value in the image through the crowd counting density map, obtaining the position coordinate of each person in the image through the crowd position distribution map, and obtaining a plurality of areas with dense crowd in the image through the zoom candidate area attention map; cutting out a plurality of areas with dense crowd from the image and improving the resolution ratio of the areas;
the circulating scaling network module is responsible for taking the areas with the dense crowd after the resolution ratio is improved as input to obtain an accurate person positioning result and updating the crowd position distribution map by using the accurate person positioning result;
and the fusion counting module is in charge of obtaining an accurate crowd counting value through weighting by utilizing the crowd counting value obtained according to the crowd counting density graph and the crowd counting value obtained according to the crowd position distribution graph.
CN201910293903.6A 2019-01-04 2019-04-12 Crowd counting and localization method and system based on circular scaling of attention mechanism Active CN110188597B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2019100070505 2019-01-04
CN201910007050 2019-01-04

Publications (2)

Publication Number Publication Date
CN110188597A CN110188597A (en) 2019-08-30
CN110188597B true CN110188597B (en) 2021-06-15

Family

ID=67714173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910293903.6A Active CN110188597B (en) 2019-01-04 2019-04-12 Crowd counting and localization method and system based on circular scaling of attention mechanism

Country Status (1)

Country Link
CN (1) CN110188597B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7383435B2 (en) * 2019-09-17 2023-11-20 キヤノン株式会社 Image processing device, image processing method, and program
CN110852267B (en) * 2019-11-11 2022-06-14 复旦大学 Crowd density estimation method and device based on optical flow fusion type deep neural network
CN111445442B (en) * 2020-03-05 2024-04-30 中国平安人寿保险股份有限公司 Crowd counting method and device based on neural network, server and storage medium
CN111428653B (en) * 2020-03-27 2024-02-02 湘潭大学 Pedestrian congestion state judging method, device, server and storage medium
CN111626184B (en) * 2020-05-25 2022-04-15 齐鲁工业大学 Crowd density estimation method and system
CN111680648B (en) * 2020-06-12 2023-04-18 成都数之联科技股份有限公司 Training method of target density estimation neural network
CN111950458A (en) * 2020-08-12 2020-11-17 每步科技(上海)有限公司 Natatorium monitoring system and method and intelligent robot
CN112084959B (en) * 2020-09-11 2024-04-16 腾讯科技(深圳)有限公司 Crowd image processing method and device
CN112183627B (en) * 2020-09-28 2024-07-19 中星技术股份有限公司 Method for generating prediction density map network and vehicle annual inspection number detection method
CN113205280B (en) * 2021-05-28 2023-06-23 广西大学 A site selection method for electric vehicle charging stations based on Lie group guided attention reasoning network
CN113284130A (en) * 2021-06-15 2021-08-20 广东蓝鲲海洋科技有限公司 Attention zooming mechanism for crowd counting
CN114120361B (en) * 2021-11-19 2023-06-02 西南交通大学 A Crowd Counting and Positioning Method Based on Codec Structure
CN114241411B (en) * 2021-12-15 2024-04-09 平安科技(深圳)有限公司 Counting model processing method and device based on target detection and computer equipment
CN114494999B (en) * 2022-01-18 2022-11-15 西南交通大学 Double-branch combined target intensive prediction method and system
CN116884033A (en) * 2023-07-05 2023-10-13 中国移动通信集团江苏有限公司 Crowd counting method, device, terminal equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3910626B2 (en) * 2003-10-21 2007-04-25 松下電器産業株式会社 Monitoring device
WO2005039181A1 (en) * 2003-10-21 2005-04-28 Matsushita Electric Industrial Co., Ltd. Monitoring device
CN102013022B (en) * 2010-11-23 2012-10-10 北京大学 Selective feature background subtraction method aiming at thick crowd monitoring scene
US10009579B2 (en) * 2012-11-21 2018-06-26 Pelco, Inc. Method and system for counting people using depth sensor
CN108764085B (en) * 2018-05-17 2022-02-25 上海交通大学 A Crowd Counting Method Based on Generative Adversarial Networks
CN108805619A (en) * 2018-06-07 2018-11-13 肇庆高新区徒瓦科技有限公司 A kind of stream of people's statistical system for billboard
CN109101930B (en) * 2018-08-18 2020-08-18 华中科技大学 A crowd counting method and system

Also Published As

Publication number Publication date
CN110188597A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN110188597B (en) Crowd counting and localization method and system based on circular scaling of attention mechanism
CN108921051B (en) Pedestrian Attribute Recognition Network and Technology Based on Recurrent Neural Network Attention Model
CN113240688A (en) Integrated flood disaster accurate monitoring and early warning method
CN111723693B (en) Crowd counting method based on small sample learning
Xu et al. High-resolution remote sensing image change detection combined with pixel-level and object-level
CN101470809B (en) Moving object detection method based on expansion mixed gauss model
CN108960047B (en) Face duplication removing method in video monitoring based on depth secondary tree
CN101986348A (en) Visual target identification and tracking method
CN115205559A (en) A method for cross-domain vehicle re-identification and continuous trajectory construction
CN109886356A (en) A target tracking method based on three-branch neural network
CN116258608B (en) Water conservancy real-time monitoring information management system integrating GIS and BIM three-dimensional technology
CN112613668A (en) Scenic spot dangerous area management and control method based on artificial intelligence
CN114627397A (en) Behavior recognition model construction method and behavior recognition method
CN119992393B (en) Unmanned aerial vehicle visual angle small target detection method based on self-attention mechanism
CN106815563A (en) A kind of crowd's quantitative forecasting technique based on human body apparent structure
CN109977968A (en) A kind of SAR change detecting method of deep learning classification and predicting
CN115601841A (en) A Human Abnormal Behavior Detection Method Combining Appearance Texture and Motion Skeleton
CN119649399A (en) Cross-modal person re-identification method based on multi-scale cross-attention Transformer
CN120032119B (en) Kelp seedling foreign matter AI detection method
Guo et al. Discriminative prototype learning for few-shot object detection in remote-sensing images
CN118795575B (en) A short-term precipitation prediction method based on deep attention voxel flow in complex frequency domain
CN118568408B (en) Crowd space-time distribution analysis method, device, equipment and medium based on monitoring array
CN117058882B (en) A traffic data compensation method based on multi-feature dual discriminator
CN116805337B (en) A crowd positioning method based on cross-scale visual transformation network
CN116824641B (en) Gesture classification method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 100871 Beijing City Mentougou District Yihelu Road No. 5

Patentee after: Peking University

Country or region after: China

Patentee after: Zhongxing Micro Technology Co.,Ltd.

Address before: 100871 Beijing City Mentougou District Yihelu Road No. 5

Patentee before: Peking University

Country or region before: China

Patentee before: Zhongxing Technology Co.,Ltd.