CN109218660B - Video processing method and device - Google Patents
Video processing method and device Download PDFInfo
- Publication number
- CN109218660B CN109218660B CN201710551156.2A CN201710551156A CN109218660B CN 109218660 B CN109218660 B CN 109218660B CN 201710551156 A CN201710551156 A CN 201710551156A CN 109218660 B CN109218660 B CN 109218660B
- Authority
- CN
- China
- Prior art keywords
- information
- video
- frame
- target
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/18—Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/454—Content or additional data filtering, e.g. blocking advertisements
- H04N21/4545—Input to filtering algorithms, e.g. filtering a region of the image
- H04N21/45457—Input to filtering algorithms, e.g. filtering a region of the image applied to a time segment
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/80—Camera processing pipelines; Components thereof
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the invention provides a video processing method and a video processing device, which are used for identifying video frames in a basic video and generating set information containing structured semantic information of each target, dividing each video frame into corresponding time groups according to the time information corresponding to each video frame, and aggregating the structured semantic information corresponding to each video frame in the time groups to obtain the structured semantic information in the group, so that image information recorded in the basic video frame can be converted into the structured semantic information containing the time information, and the structured semantic information of each target is aggregated in a time period by utilizing the characteristic that target activities have continuity, so that the structured semantic information in the time period is obtained. When a user needs to know the video information recorded in the basic video, the information recorded in the basic video can be basically known based on the in-group structural semantic information corresponding to each time period, so that the browsing amount of the user is greatly reduced, the burden of the user is reduced, and the user experience is improved.
Description
Technical Field
The present invention relates to the field of computers, and in particular, to a video processing method and apparatus.
Background
In the security field, video monitoring is a very important means and is also a most widely used scheme, such as the public transportation field, the commercial and civil anti-theft fields and the like, and a monitoring camera is visible everywhere. Because the monitoring camera can always objectively record the image in the monitored area, a credible basis can be provided when people trace to inquire. However, since the monitoring cameras can objectively record all the pictures, the videos shot by many cameras contain more useless information, which is particularly prominent in the field of civil security: for example, if a user wants to know what happens at his home on monday through a video recorded by a camera installed at home, the user needs to completely browse all videos recorded by the camera on monday, which takes a lot of time. The user does not spend so much time perusing it, but rather selects a partial period of video from the complete video for viewing based on subjective guesses. However, the browsing method is too subjective, and users can easily miss important video information.
It is only during the week that the user desires to know whether a person or animal is moving in the monitored area, and there is no interest in other completely still pictures. Therefore, in the existing video processing scheme, the video frames with motion information are retained by the motion detection technology and are browsed by the user. Although the scheme can screen out partial useless video frames for the user, the reserved video is still long, the user cannot grasp key points, and a large amount of meaningless motion videos such as illumination changes, curtains blown by wind and the like are easily recorded.
Therefore, it is needed to provide a new video processing scheme for extracting information that is interesting to a user from a large amount of videos, so as to reduce the problems that the user needs to spend a large amount of time to know video information and the user experience is low.
Disclosure of Invention
The embodiment of the invention provides a video processing method and a video processing device, which mainly solve the technical problems that: the problem of low user experience caused by the fact that a user needs to browse a large number of videos due to the fact that video frames with motion information are reserved for the user to browse through a mobile detection technology in an existing video processing scheme is solved.
To solve the foregoing technical problem, an embodiment of the present invention provides a video processing method, including:
carrying out target identification on video frames in a basic video, and generating set information for each video frame according to an identification result, wherein the set information comprises structured semantic information of each target in the video frames, and the structured semantic information comprises time information, target information and target behavior information which are arranged according to a preset sequence;
dividing each video frame into corresponding time groups according to the time information corresponding to each video frame;
and aggregating the structural semantic information corresponding to each video frame in the time group to obtain the intra-group structural semantic information aiming at the time group.
The embodiment of the invention also provides a video processing device, which comprises a processor, a memory and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is used for executing the terminal software test program stored in the memory so as to realize the following steps:
carrying out target identification on video frames in a basic video, and generating set information for each video frame according to an identification result, wherein the set information comprises structured semantic information of each target in the video frames, and the structured semantic information comprises time information, target information and target behavior information which are arranged according to a preset sequence;
dividing each video frame into corresponding time groups according to the time information corresponding to each video frame;
and aggregating the structural semantic information corresponding to each video frame in the time group to obtain the intra-group structural semantic information aiming at the time group.
An embodiment of the present invention further provides a computer storage medium, where a computer-executable instruction is stored in the computer storage medium, and the computer-executable instruction is used to execute any one of the foregoing video processing methods.
The invention has the beneficial effects that:
the video processing method, the video processing device and the computer storage medium provided by the embodiment of the invention have the advantages that the target identification is carried out on the video frames in the basic video, the set information containing the structured semantic information of each target is generated for each video frame according to the identification result, then each video frame is divided into the corresponding time group according to the time information corresponding to each video frame, and then the structured semantic information corresponding to each video frame in the time group is aggregated to obtain the structured semantic information in the group aiming at the time group. The video processing scheme provided by the invention can convert the image information recorded in the basic video frame into the structured semantic information containing the time information, and simultaneously, the structured semantic information of each target is aggregated in a time period by utilizing the characteristic that the target activity has continuity, so that the intra-group structured semantic information of the time period is obtained. When a user needs to know the video information recorded in the basic video, the information recorded in the basic video can be basically known based on the in-group structured semantic information corresponding to each time period in the basic video. The method greatly reduces the browsing amount of the user, reduces the burden of the user and improves the user experience.
Drawings
Fig. 1 is a flowchart of a video processing method according to an embodiment of the present invention;
fig. 2 is a flowchart of a video processing method according to a second embodiment of the present invention;
fig. 3 is a flowchart of finding structured semantic information for the same target in the aggregate information of each video frame according to the second embodiment of the present invention;
fig. 4 is a schematic diagram illustrating a relationship between behavior positions of objects in a video frame according to a second embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating the relationship between the behavior positions of the targets in the next frame adjacent to the video frame of FIG. 4;
fig. 6 is a flowchart of a video processing method according to a third embodiment of the present invention;
fig. 7 is a schematic hardware configuration diagram of a video processing apparatus according to a fourth embodiment of the present invention;
fig. 8 is a schematic hardware configuration diagram of a video processing apparatus according to a fifth embodiment of the present invention.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
The first embodiment is as follows:
in order to solve the problem that the user needs to browse a large number of videos and the user experience is low due to the fact that a video frame with motion information is reserved for the user to browse through a motion detection technology in the existing video processing scheme, the invention provides a video processing method, please refer to a flow chart of the video processing method shown in fig. 1:
s102, carrying out target recognition on the video frames in the basic video, and generating set information for each video frame according to the recognition result.
By target, it is generally meant a monitored object of interest to the user, such as a person, pet, etc. It should be understood that, under the condition that no human power is applied to static objects, such as electric appliances and furniture in a household, no spatial position change occurs all the year round, so that a user installing a monitoring camera does not actually aim at monitoring the static objects in the monitoring range of the monitoring camera, but aims at monitoring human beings or animals and the like which can have action actions and change the spatial position, so that the actions of the human beings or the animals can be known according to the monitored videos. In this embodiment, the video processing apparatus may adopt an intelligent video analysis method, such as a traditional feature descriptor + classifier method, a deep learning method, and the like, when performing target recognition in a video frame.
The set information of a video frame includes structured semantic information of each object in the video frame, where the structured semantic information is information related to objects in a video according to a preset arrangement order, for example, the structured semantic information in this embodiment includes object information, object behavior information, and time information when the objects make related behaviors, and the time information, the object information, and the object behavior information are always arranged in a fixed order, for example, for a jth object in an ith frame, the structured semantic information may be characterized as:
SMi,jtime information, object behavior information]T
It should be understood that the time information in the jth target structured semantic information in the ith frame is the time information corresponding to the ith video frame. It is needless to say that the video processing apparatus may also sequence the three according to other sequences, but once the structured semantic information of the jth target in the ith frame is adjusted, the structured semantic information of all targets of all video frames in the base video needs to be adjusted in the same way, so that the structured semantic information of all targets in all video frames is ensured to meet the same standard, and the subsequent processing is facilitated.
In this embodiment, the total number of objects in one video frame is referred to as "intra number", and it is assumed that the intra number of the ith frame is J, where J may be greater than or equal to 1. Wherein the set information of the ith frame is:
SMi={SMij|0<j<J}
in an example of the embodiment, the video processing apparatus may directly use the original video captured by the monitoring camera as the base video. However, considering that there are a lot of video frames in the original video and no target is included, in this embodiment, the video processing apparatus does not directly use the original video as the base video. The video processing apparatus needs to acquire the base video from the original video before performing object recognition on the video frames in the base video.
The video processing device screens and obtains a basic video from an original video, wherein the basic video is mainly a video frame which does not contain motion information in the original video, so that the video processing device can detect the motion information of each video frame in the original video and then screen out the video frame which does not contain the motion information in the original video to obtain the basic video. When detecting motion information, the video processing apparatus may perform at least one of an optical flow method, an inter-frame difference method, and a background difference method:
the optical flow (optical flow) method is an instantaneous velocity of a pixel movement of a spatially moving object on an observation imaging surface. When the object moves, the brightness mode of the corresponding point on the image also moves correspondingly, and the apparent motion of the brightness mode of the image is the optical flow. The optical flow study uses the temporal variation and correlation of intensity data of pixels in an image sequence to determine the "motion" of the respective pixel location. The optical flow expresses the change of the image and can therefore be used by the observer to determine the movement of the object. In general, optical flow results from camera motion, object motion in a scene, or the common motion of both. The optical flow method detects a moving object, and the basic idea is to give a velocity vector to each pixel point in an image, so that a motion field of the image is formed. The points on the image and the points on the three-dimensional object are in one-to-one correspondence at a certain specific motion moment, and the image is dynamically analyzed according to the speed vector characteristics of each pixel point. If no moving object exists in the image, the optical flow vector is continuously changed in the whole image area, and when the object and the image background have relative motion, the speed vector formed by the moving object is necessarily different from the speed vector of the neighborhood background, so that the position of the moving object is detected.
The interframe difference method detects a moving target by carrying out difference or 'subtraction' operation on adjacent video frames in a sequence image and utilizing strong correlation of the adjacent video frames in the sequence image to carry out change detection. She extracts the motion area in the sequence image by directly comparing the difference of the gray value of the corresponding pixel point of the adjacent video frame and then selecting the threshold value.
The background difference method is a commonly used moving target detection method under the condition that the background is static, and is mainly characterized in that a current video frame image and a background image obtained in real time or a background image obtained in real time are subjected to subtraction operation to obtain a difference image, and then a selected threshold value is used for binarization processing to obtain a moving target area. Background hair cutting is simple to operate and can provide complete characteristic data, but is particularly sensitive to noise interference such as weather and illumination.
In general, the video processing apparatus may combine the above-described inter-frame difference method with the background difference method to realize the detection of the motion information of the video frame.
And S104, dividing each video frame into corresponding time groups according to the time information corresponding to each video frame.
Since the target usually lasts for a while when doing something, especially when a person does something, for example, when a person is watching tv and reading a book, the actions of watching tv and reading a book are not completed in a moment but last for a long time. Therefore, the method is added in the monitoring range of the monitoring camera, and if a person sits and watches the television, the person should sit facing the television in a plurality of video frames collected by the monitoring camera. For a user who needs to know the basic video content, the user does not need to know what the state of the monitored target is in each video frame, because in the video, the time corresponding to one video frame is the time which is hardly perceived by the user: due to the special physiological structure of the human eye, a picture is considered coherent if the frame rate of the viewed picture is higher than 16, which is called persistence of vision. The typical video frame rate is also larger than 16, that is, the time corresponding to each video is even smaller than 1/16 seconds. The behavior of the target will not start and end in such a short time, so it is of no significance for the user to know the behavior of the target in each frame, and the user really wants to know what the monitored target does in a longer time, so ideally, the video processing apparatus should summarize to the user what the monitored target does in the time range corresponding to the base video.
In this embodiment, the video processing apparatus divides each video frame into corresponding time groups according to the time information corresponding to each video frame, so as to perform the structured semantic information aggregation processing on each time group in the subsequent process. Since the subsequent aggregation processing is performed by taking the time group as an object after each video frame is divided into the corresponding time group, the video processing apparatus should ensure that the target behaviors corresponding to the video frames divided into one time group are relatively consistent. In an example of this embodiment, the video processing apparatus divides each video frame in the base video into corresponding time groups according to a preset time duration and time information corresponding to each video frame, that is, the difference between the time information of each video frame divided into the same time group does not exceed the preset time duration, that is, the time difference between the start frame and the end frame in each divided time group does not exceed the preset time duration.
According to the experience of video processing, the behavior of the target can generally last for 15 minutes, so in this example, it is assumed that the preset time is 15 minutes, that is, the video processing apparatus sets a time group every 15 minutes, and assuming that the starting time of the base video is 10:00 am, the first time group is 10: 00-10: 15. The video processing device divides all video frames with time information between 10:00 and 10:15 into a first time group. Similarly, the video processing apparatus will divide … … all video frames with time information between 15:00 and 30:00 into a second time group, and so on until dividing each video frame in the base video into the corresponding video group.
It is assumed that, through the division by the video processing apparatus, the base video is divided into K packets, where the kth time group can be represented as:
SMSegk={SMii belongs to the kth time group
In other examples of the embodiment, the preset time period does not have to be set to 15 minutes, and even in some examples, the size of the preset time period can be set by user customization, and the user can determine the size of the preset time period according to the motion characteristics of the monitored target: for example, if the monitored target is quiet, the preset duration may be set to be longer, because the monitored target may execute the same behavior for a long time; if the monitored target is moving, the user may set the preset time period to be slightly shorter, because the moving monitored target may have performed several kinds of tasks within the same time period. The fact that monitoring cameras are installed in the first user and the second user in respective homes is assumed to respectively monitor the pet cat and the pet dog, the fact that the cat is quiet and the dog is active can be known according to the habits of the cat and the dog, therefore, the first user is likely to be large when the preset time is set, and therefore the first user can enable the video processing device to aggregate the structural semantic information of each video frame of the basic video as simple as possible, and the time spent on knowing the content of the basic video by the first user is reduced; the second user needs to set the preset time length to be smaller so as to fully and fully know various behaviors of the pet dog during the monitored period.
S106, aggregating the structural semantic information corresponding to each video frame in the time group to obtain the intra-group structural semantic information aiming at the time group.
After the video processing device divides each video frame into corresponding time groups according to the time information corresponding to each video frame, the video processing device performs structured semantic information aggregation on each time group, so as to obtain the intra-group structured semantic information on each time group.
In an example of the embodiment, the video processing apparatus may directly perform structured semantic information aggregation on time groups obtained by dividing each video frame in the basic video frame. Ideally, the number of frames in each video frame within the preset time period, i.e., the total number of targets in each video frame, should be consistent. However, this does not consider the case where the target is occluded and not captured by the monitoring camera, and also does not consider the case where an error occurs when the video processing apparatus recognizes the target in each video frame. Since the target is occluded and the error in identifying the target is a few cases, in this embodiment, before the video processing apparatus performs structured semantic information aggregation on a certain time group, a part of video frames in the time group, which easily causes inaccurate aggregation result, may be removed first, and in this embodiment, two ways are provided to perform video frame screening on the time group:
first, the video processing apparatus calculates an intra-group average of the number of frames of each video frame within a time group, retains the video frames whose number within the frames is closest to the intra-group average, and screens out the remaining video frames. For example, assuming that the original in a certain time group contains 22500 video frames, of which 21000 video frames contain 3 objects, 1000 video frames in the remaining 1500 video frames contain 2 objects, and 500 video frames contain 1 object, the intra-group mean value
Avg=(21000*3+1000*2+500*1)/22500≈2.91
Therefore, the intra-group mean of the time group is 2.91, and the number of frames closest to the intra-group mean is 3, so the video processing apparatus retains 21000 video frames with the number of frames of 3, and screens out 1500 video frames with the number of frames of 2 and 1, respectively.
Second, the video processing apparatus retains video frames whose number in the time group is equal to the number of high frequencies in the time group, and screens out the remaining video frames, the number in the frame being the number of objects included in each video frame, and the number of high frequencies in the group being the number in the frame having the highest frequency of occurrence in the time group. For example, also for the above time group, the number of the most frequently occurring intra frames in the group is 3, so the video processing apparatus will retain 21000 video frames with the number of intra frames being 3, and screen out 1500 video frames with the number of intra frames being 2 and 1, respectively.
The video processing apparatus may perform structured information aggregation on each video frame in a group after screening out video frames that easily cause aggregation deviation in each time group. After the intra-group structured semantic information for the time group is obtained through aggregation, the user can directly know what behavior is performed by the monitored target in the time corresponding to the time group based on the intra-group structured semantic information. For example, assuming that the number of frames in a certain time group k is J, the structured semantic information in the group of the k-th time group is
{ [ jth target, target behavior information of jth target]T|j∈[1,J]};
The aggregated intra-group structured semantic information may not include time information, but in order to make the user understand that the intra-group structured semantic information is specific to a certain time period, the video processing apparatus needs to additionally show the user a corresponding relationship between each time group and a specific time, for example, the specific time corresponding to the first time group is 10:00 to 10:15 in the morning. After a user sees the intra-group structured semantic information of the first time group, the intra-group structured semantic information can be determined to describe the monitoring content of 10: 00-10: 15 in the morning. In another example of the embodiment, the intra-group structured semantic information obtained after aggregation may also include time information of a time group.
In addition, in an example of the embodiment, the video processing apparatus directly presents to the user not the intra-group structured semantic information, but obtains the description summary information according to the intra-group structured semantic information, and the description summary information converts the standard structured language into the text description and/or the picture description according to the content of each part in the intra-group structured semantic information, for example, the description summary information obtained after conversion is the text information: the method comprises the following steps that a first target is making an X event, a second target is making a Y event, and a third target is making a Z event at 10: 00-10: 15 in the morning.
According to the video processing method provided by the embodiment of the invention, the set information containing the structured semantic information of each target in the video frame for each video frame is generated by carrying out target identification on each video frame in the basic video, then each video frame is divided according to time, and finally the structured semantic information aggregation is carried out on each time group obtained by dividing, so that the structured semantic information in the group for the time group is obtained, a user can directly know the main information collected by the monitoring camera in the corresponding time period according to the structured semantic information in the group, and the method that the user spends a large amount of time when browsing the basic video is avoided. Meanwhile, when the video processing device acquires the basic video, the video frames which do not contain the motion information in the original video can be screened out, the processing amount of subsequent processing is reduced, the processing efficiency is improved, and the problem of processing resource waste is reduced.
On the other hand, after the video processing device obtains the intra-group structural semantic information, the description summary information is generated according to the intra-group structural semantic information, so that a user can know things happening in each time period more intuitively through the description summary information, and user experience is improved.
Example two:
in this embodiment, the video processing method provided by the present invention will be further described, and in particular, the process of aggregating the structured semantic information will be described in detail, please refer to a flowchart of the video processing method shown in fig. 2:
s202, detecting motion information of each video frame in the original video, and screening out video frames which do not contain the motion information in the original video to obtain a basic video.
Since the reason and the manner for screening out the video frames that do not include motion information are explained in more detail in the first embodiment, they are not described herein again.
And S204, carrying out target identification on the video frames in the basic video, and generating set information for each video frame according to the identification result.
Similar to the first embodiment, the set information of a video frame includes the structured semantic information of each object in the video frame. In this embodiment, the structured semantic information still includes time information, target information and target behavior information, where the target information includes a target category, such as "human" or "animal" and the like; the target behavior information includes a target behavior position and a behavior category, where the behavior category includes "playing mobile phone", "watching television", "reading", "eating food", and the like, and the behavior category may be defined and set by a user according to a place corresponding to a base video, for example, for a base video collected in a living room, the behavior category includes "playing mobile phone", "watching television", "reading", "eating food", and the like, and for a base video originated from a dining room, the behavior category may be "eating food", "picking up table", and the like. The target action location refers to a coordinate location where the corresponding action occurs. The video processing apparatus needs to obtain the above four kinds of information when performing object recognition on one video frame.
In an example of the embodiment, the video processing apparatus quantizes the target information and the target behavior information in the structured semantic information, for example, "person" is "0", "cat" is "1", "dog" is quantized to "2" … …, and so on. Therefore, "0", "1", and "2" are standard object information of three object categories of "person", "cat", and "dog", respectively. Similarly, the video processing apparatus quantifies and represents "eat", "watch tv", and "play the electronic device" by integers "3", "4", and "5", respectively, so that "3", "4", and "5" are standard behavior categories of the three behavior categories "eat", "watch tv", and "play the electronic device", respectively.
For the structured semantic information of a certain target in a certain video frame, the video processing apparatus may use a four-dimensional vector to represent, for example, the structured semantic information of the jth target in the ith frame is:
SMi,j=[FrNoi,Posi,j,Typei,j,Actioni,j]T
wherein, FrNoiIndicating the frame number of the ith frame, it being understood thatThe frame number may characterize the temporal information of the ith video frame. Posi,jIndicates the behavior position, Type, of the jth target in the ith framei,jIndicates the object type, Action, of the jth object in the ith framei,jRepresenting the behavior class of the jth object in the ith frame.
Assuming that the total number of objects contained in the ith frame, i.e. the number in the frame is J, the set information of the ith frame can be represented by the structured semantic information set of all objects in the ith frame, and assuming that the ith frame has J objects in total, the set information of the ith frame is
SMi={SMij|0<j<J}
And S206, dividing each video frame into corresponding time groups according to the time information corresponding to each video frame.
After the video processing device obtains the set information of each video frame in the basic video, the video processing device divides each video frame into corresponding time groups according to preset time length. It should be understood that the video frames are divided into corresponding time groups, that is, the structured semantic information corresponding to the video frames is divided into corresponding time groups. Assuming that a total of K time groups are obtained through the division of the video processing apparatus, the K-th time group can be expressed as
SMSegk={SMiI belongs to the kth time group
And S208, screening out video frames in each time group, which can cause inaccurate aggregation structure.
After the video processing device divides each video frame in the basic video frame into the corresponding time groups, the video frames in each time group, which may cause inaccurate aggregation results, may be screened out by the video processing device in any one of the two ways in the first embodiment.
S210, aggregating the structural semantic information corresponding to each video frame in the time group to obtain the intra-group structural semantic information aiming at the time group.
In a time group, the aggregate information for each video frame is actually that of the objects in the video frameAnd structuring the semantic information set. Therefore, SMSegkIs actually a collection of structured semantic information for each object in each video frame over a period of time. The aggregation process is actually to obtain the structured semantic information based on each target in the time group, for example, to obtain the target X to do S at the P position in the time period T by aggregation. Therefore, the method for aggregating the structured semantic information corresponding to each video frame in a time group has the key point that the structured semantic information aiming at the same target in each video frame set information is found, and then the structured semantic information aiming at the same target is aggregated to obtain the intra-group structured semantic information aiming at the target.
Therefore, in order to aggregate the structured semantic information of each video frame in the kth time group, in this embodiment, the video processing apparatus needs to first find the structured semantic information for the same target in the set information of each video frame, and then aggregate the structured semantic information for the same target to obtain the intra-group structured semantic information for the target.
If there is only one object in each video frame of the kth time group, the situation is very simple, the set information of each video frame is only composed of the structured semantic information set of the unique object, so the video processing device can directly take the unique structured semantic information in each set information as the structured semantic information aiming at the object.
However, if J objects are included in each video frame of the kth time group, and J is greater than or equal to 2, then: the video processing apparatus may find the structured semantic information for the same target in each set information in the manner shown in fig. 3:
s302, regarding any two adjacent video frames, the video processing device takes the arrangement sequence of each structural semantic information in the previous frame set information as a standard sequence.
Firstly, suppose that the video processing apparatus is currently searching the structured semantic information of the ith frame and the (i +1) th frame in the kth time group for the same target, and that there are three targets A, B, C in each video frame in the kth time group; in additionBesides, the structured semantic information SM corresponding to the ith video frame is determinediThe time is described from left to right and from top to bottom according to the position of the target in the video frame. In the present embodiment, the frame before the two adjacent video frames is referred to as "preceding frame" and the other video frame is referred to as "succeeding frame", for example, the ith video frame (preceding frame) and the (i +1) th video frame (succeeding frame) are respectively shown in fig. 4 and 5, where SMiAnd SMi+1Set information for the ith frame and the (i +1) th frame, respectively:
SMi={SMi,1,SMi,2,SMi,3}
SMi+1={SMi+1,1,SMi+1,2,SMi+1,3}
wherein, SMi,1,SMi,2,SMi,3A, B, C structured semantic information in previous frames, respectively, and SMi+1,1,SMi+1,2,SMi+1,3Also structured semantic information of three objects in the following frame, but since different video frames correspond to different times and the objects may change their positions at different times, it is currently impossible for a video processing apparatus to determine the SMi+1,1,SMi+1,2,SMi+1,3Whether it is exactly the corresponding target A, B, C. That is, the video processing apparatus cannot determine whether the object 51 in the subsequent frame is the object 41 in the preceding frame, whether the object 52 in the subsequent frame is the object 42 in the preceding frame, and whether the object 53 in the subsequent frame is the object 43 in the preceding frame.
To determine SMi+1,1,SMi+1,2,SMi+1,3Which target corresponds to, in this embodiment, the video processing apparatus may first set the previous frame set information SMiMiddle { SMi,1,SMi,2,SMi,3The ordering of the fingers is taken as a standard ordering.
S304, the video processing device sequences each structural semantic information in the post-frame set information to obtain J! And (5) sorting to be selected.
For the later frame, there may be 3! Candidate orderings, e.g. candidate orderings { SMi+1,1,SMi+1,2,SMi+1,3}, sorting to be selected { SMi+1,1,SMi+1,3,SMi+1,2}, sorting to be selected { SMi+1,3,SMi+1,2,SMi+1,1… …, etc. It should be appreciated that when J targets are included in a time group, the candidate ordering also has J! And (4) seed preparation.
S306, the video processing device respectively calculates the distance between two behavior positions with the same sequence number in the sequence to be selected and the standard sequence according to the behavior positions in each structural semantic information in the front frame and the rear frame.
Since the structured semantic information corresponding to the previous frame includes the behavior position of each object, and the structured semantic information corresponding to the subsequent frame also includes the behavior position of each object, for each candidate sequence, the video processing apparatus may calculate distance information S1 between the first behavior position in the previous frame and the first behavior position in the subsequent frame, distance information S2 between the second behavior position in the previous frame and the second behavior position in the subsequent frame, and distance information S3 between the third behavior position in the previous frame and the third behavior position in the subsequent frame. For the scenario with the number of J in the frame, the calculation method is also analogized in sequence.
S308, the video processing device calculates the distance sum of the action positions corresponding to the sequence numbers in the standard sequence and the action positions in the to-be-selected sequence.
For a scenario with a frame number of 3, the video processing apparatus may calculate the distance S between each object in each candidate sequence and each object in the standard sequence. For example, for the xth candidate rank, the sum of distances Sx is S1+ S2+ S3.
S310, the video processing device selects the distance and the minimum candidate sorting as the standard sorting of the post-frame set information.
Finally, the video processing apparatus may be viewed from J! And selecting one of the sequences to be selected with the smallest distance and the smallest S, and taking the sequence to be selected as the standard sequence of each structural semantic information in the later frame. For example, for the example above where the number of frames is 3, the final calculation determines the distance sum of the first candidate ordering, which is { SM }, to be the minimumi+1,1,SMi+1,2,SMi+1,3The position of the second end of the column is, that is,SM in late framei+1,1,SMi+1,2,SMi+1,3Corresponding to a, B, C, respectively. Namely SMiAnd SMi+1The first structured semantic information is for target a, the second structured semantic information is for target B, and the third structured semantic information is for target C. The processing for other video frames is similar.
After determining the structured semantic information for the same target in the time group, the video processing apparatus may aggregate the structured semantic information for the target to obtain the group-internal structured semantic information for the target. It is assumed that the structured semantic information for the first object within a time group has respective SM1,1、SM2,1And SM3,1Wherein
SM1,1=[1,(1,2),0,5]T
SM2,1=[2,(1,3),1,6]T
SM3,1=[3,(1,2),0,6]T
Firstly, the video processing device calculates average target information and average target behavior information of the target in a time group according to the semantic structural information corresponding to each video frame, and for the target in the example, the video processing device determines the average behavior position to be ((1+ 1)/3, (2+3+2)/3), namely (1, 7/3); the average target class is (0+1+0)/3, namely 1/3; the average target behavior category is (5+6+6)/3, i.e., 17/3.
Then, the video processing apparatus selects actual target information and actual target behavior information for the target from preset standard target information and standard target behavior information, wherein the actual target information is the standard target information closest to the average target information, and the actual target behavior information is the standard target behavior information closest to the average target behavior information. For example, the standard target information of three target categories of "person", "cat" and "dog" has been introduced as "0", "1" and "2", and the standard target behavior information of three behaviors of "eat", "watch tv" and "play electronic equipment" is "3", "4" and "5", so that it can be finally known through matching that the target has the intra-group structured semantic information of the time group
SMReality i=[(1,7/3),0,6]T
In fact, after obtaining the structured semantic information, the video processing apparatus may present the information to the user directly, while showing the correspondence between each coordinate position and each functional area of the family class, the correspondence between each standard target information and the corresponding standard target, and the correspondence between each standard target behavior information and the corresponding standard target behavior to the user to clarify the meaning represented by the structured semantic information in the group according to these correspondences.
And S212, obtaining description summary information aiming at the time group according to the structured semantic information in the group.
In an example of this embodiment, the video processing apparatus determines the meaning of the structured semantic information in the group according to the actual target information, the actual target behavior information and the correlation, obtains description summary information for the time group, and then shows the description summary information to the user, so that the user can know what happens in the corresponding time.
The video processing method provided by this embodiment can convert image information recorded in a basic video frame into structured semantic information including time information, and at the same time, aggregate the structured semantic information of each target in a time period by using the characteristic that target activities have persistence to obtain the intra-group structured semantic information of the time period. When a user needs to know the video information recorded in the basic video, the information recorded in the basic video can be basically known based on the in-group structured semantic information corresponding to each time period in the basic video. The method greatly reduces the browsing amount of the user, reduces the burden of the user and improves the user experience.
Example three:
the present embodiment will further describe the video processing method in the foregoing embodiment with reference to specific examples, please refer to fig. 6:
and S602, obtaining a basic video from the original video by adopting an inter-frame difference method.
The original video of the base video contains a set of video frames containing motion information. And detecting a video frame containing motion information in the original video by adopting an inter-frame difference method aiming at the brightness component Y of the original video. For example, the threshold value T is set based on an empirical valuediffCalculating the average absolute deviation value AvgDiff of a video frame and the adjacent previous video frame or a plurality of video frames aiming at a certain video frame, and if the AvgDiff exceeds a set threshold value TdiffIf the video frame does not contain the motion information, the video frame is considered to contain the motion information.
S604, detecting people in the video by adopting an RFCN method, identifying task behavior information and determining behavior positions.
RFCN (Region-based full volumetric Networks), namely area-based full convolution network algorithm. The video processing apparatus in the present embodiment detects and identifies only the person in the video frame, because the user is interested in only the person in the surveillance video. Therefore, only one type of the target type is 'person', the target type is represented by an integer 0, and if more types exist, the target type is sequentially increased; the behavior categories to be recognized are three categories of 'eating things', 'watching television' and 'playing electronic equipment', which are respectively represented by integers 0, 1 and 2, and if more types exist, the behaviors are sequentially increased.
Assume that frame 10 has 3 people in total, where the first person coordinate position is (x0, y0) and the behavior is "eat"; the second person coordinate position is (x1, y1) and the behavior is "watch TV"; the third person coordinate position is (x2, y2) and the behavior is "play electronic device"; the set information of the video frame can be expressed as:
SM10,0=[10,(x0,y0),0,0]T
SM10,1=[10,(x1,y1),0,1]T
SM10,0=[10,(x2,y2),0,2]T
similarly, other video frames are described in the same manner.
And S606, carrying out time group division on the basic video of one day according to the preset time length of 1 hour.
Assuming that the time period set by the user is one day, clustering the video frame character semantic sets of the current day to generate activity event summary information of the current day, wherein the specific implementation steps are as follows:
assume that the aggregate information of all video frames of the current day is represented as:
SM={SMii belongs to the current day
First, the video processing apparatus performs time group division on each video frame according to a preset time length of 1 hour, and assuming that K is 16 groups in total, the current-day text semantic set can be expressed as
SM=SMSeg0、SMSeg1、…、SMSeg16
Taking the k-th group as an example, it can be expressed as:
SMSeg5={SMii belongs to the 5 th time group
And S608, aggregating the structural semantic information corresponding to each video frame in each time group.
For each time group, the video processing device counts an intra-group mean value of the time group and screens out video frames with the number of intra-group frames not equal to the intra-group mean value.
Subsequently, the video processing apparatus performs structured semantic information aggregation on the remaining video frames in each time group according to the description in embodiment two.
And S610, obtaining description summary information aiming at the time group according to the structured semantic information in the group.
And finally, outputting the text description of the activity event of the current day by taking the time group as a unit according to the structural semantic information in the group obtained by aggregation, wherein the text description comprises the starting time of all groups, the number of the contained targets, and the type, the activity position and the behavior category of each target. For example, "11 am 30 to 11 am 45, for a total of 4 people in an activity where a first person watches television on a sofa, a second person eats something on a table, and a third person plays with an electronic device on the sofa".
The video processing scheme provided by the embodiment can greatly simplify the workload of knowing video information by a user, and enables the user to know video content in a simple and visual mode, thereby improving user experience.
The fourth embodiment:
in this embodiment, a video processing apparatus in the foregoing embodiment is described, please refer to a schematic diagram of a hardware structure of the video processing apparatus shown in fig. 7:
the video processing apparatus 70 includes a processor 71, a memory 72 and a communication bus 73, wherein the communication bus 73 is used for realizing connection communication between the processor 71 and the memory 72, and the memory 72 is a computer-readable storage medium in which at least one computer program is stored, and the computer program can be read, compiled and executed by the processor 71, so as to realize the corresponding processing flow. For example, in the present embodiment, a video processing program is stored in the memory 72, and the processor 71 can implement the video processing method described in the foregoing embodiments by executing the computer program.
First, the processor 71 performs object recognition on the video frames in the base video, and generates set information for each video frame according to the recognition result.
The set information of a video frame includes structured semantic information of each object in the video frame, where the structured semantic information is information related to objects in a video according to a preset arrangement order, for example, the structured semantic information in this embodiment includes object information, object behavior information, and time information when the objects make related behaviors, and the time information, the object information, and the object behavior information are always arranged in a fixed order, for example, for a jth object in an ith frame, the structured semantic information may be characterized as:
SMi,jtime information, object behavior information]T
It should be understood that the time information in the jth target structured semantic information in the ith frame is the time information corresponding to the ith video frame. It is needless to say that the processor 71 may also sequence the three according to other sequences, but once the structured semantic information of the jth target in the ith frame is adjusted, the structured semantic information of all targets of all video frames in the base video needs to be adjusted in the same way, so that the structured semantic information of all targets in all video frames is ensured to meet the same standard, and the subsequent processing is facilitated.
In this embodiment, the total number of objects in one video frame is referred to as "intra number", and it is assumed that the intra number of the ith frame is J, where J may be greater than or equal to 1. Wherein the set information of the ith frame is:
SMi={SMij|0<j<J}
in an example of the embodiment, the processor 71 may directly use the original video captured by the monitoring camera as the base video. However, considering that there are a lot of video frames in the original video and no target is included, in this embodiment, the processor 71 does not directly use the original video as the base video. The processor 71 also needs to obtain the base video from the original video before performing object recognition on the video frames in the base video.
The processor 71 screens and obtains the base video from the original video, mainly by screening out video frames that do not contain motion information in the original video, so the processor 71 detects motion information of each video frame in the original video, and then screens out video frames that do not contain motion information in the original video to obtain the base video. When detecting motion information, the processor 71 may perform motion information detection by using at least one of an optical flow method, an inter-frame difference method, and a background difference method. In general, the processor 71 may combine the above-mentioned inter-frame difference method with the background difference method to realize the detection of the motion information of the video frame.
The processor 71 then divides each video frame into corresponding time groups according to the time information corresponding to each video frame. In this embodiment, the processor 71 divides each video frame into corresponding time groups according to the time information corresponding to each video frame, so as to perform the structured semantic information aggregation processing on each time group in the subsequent process. Since the subsequent aggregation process is performed on the time groups after the video frames are divided into the corresponding time groups, the processor 71 should ensure that the target behaviors corresponding to the video frames divided into one time group are relatively consistent. In an example of this embodiment, the processor 71 divides each video frame in the base video into corresponding time groups according to a preset time duration and time information corresponding to each video frame, that is, the difference between the time information of each video frame divided into the same time group does not exceed the preset time duration, that is, the time difference between the start frame and the end frame in each divided time group does not exceed the preset time duration.
According to the video processing experience, the behavior of the target may generally last 15 minutes, so in this example, assuming a preset duration of 15 minutes, that is, processor 71 sets one time group for each 15 minutes, and assuming that the starting time of the base video is 10:00 am, the first time group is 10:00-10:15 am. The processor 71 divides all video frames having time information between 10:00 and 10:15 into a first time group. Similarly, the processor 71 divides … … all video frames with time information between 15:00 and 30:00 into a second time group, and so on until dividing each video frame in the base video into corresponding video groups.
Assuming that the base video is divided into K packets by the division of the processor 71, the kth time group can be represented as:
SMSegk={SMii belongs to the kth time group
In other examples of the embodiment, the preset time period does not have to be set to 15 minutes, and even in some examples, the size of the preset time period can be set by user customization, and the user can determine the size of the preset time period according to the motion characteristics of the monitored target: for example, if the monitored target is quiet, the preset duration may be set to be longer, because the monitored target may execute the same behavior for a long time; if the monitored target is moving, the user may set the preset time period to be slightly shorter, because the moving monitored target may have performed several kinds of tasks within the same time period. Assuming that two users A and B are respectively provided with monitoring cameras at home to respectively monitor pet cats and pet dogs, the cats are quiet and the dogs are moving according to the habits of the cats and the dogs, so that the user A is likely to be large when the preset time is set, and the user A can enable the processor 71 to aggregate the structural semantic information of each video frame of the basic video as simple as possible, thereby reducing the time spent on knowing the content of the basic video by the user A; the second user needs to set the preset time length to be smaller so as to fully and fully know various behaviors of the pet dog during the monitored period.
After the processor 71 divides each video frame into corresponding time groups according to the time information corresponding to each video frame, it performs structured semantic information aggregation for each time group, thereby obtaining the intra-group structured semantic information for each time group.
In an example of the embodiment, the processor 71 may directly perform structured semantic information aggregation on time groups obtained by dividing each video frame in the base video frame. Ideally, the number of frames in each video frame within the preset time period, i.e., the total number of targets in each video frame, should be consistent. However, this does not take into account the situation that the object is occluded and not captured by the monitoring camera, nor the situation that an error occurs when the processor 71 identifies the object in each video frame. Since the target is occluded and the error in identifying the target is a few cases, in this embodiment, before the processor 71 performs structured semantic information aggregation on a certain time group, it may also remove a part of video frames in the time group that easily cause inaccurate aggregation result, and in this embodiment, two ways are provided to perform video frame screening on the time group:
first, processor 71 calculates an intra-group mean of the number of frames in each video frame within the time group, retains the video frames whose number of frames is closest to the intra-group mean, and screens out the remaining video frames. For example, assuming that the original in a certain time group contains 22500 video frames, of which 21000 video frames contain 3 objects, 1000 video frames in the remaining 1500 video frames contain 2 objects, and 500 video frames contain 1 object, the intra-group mean value
Avg=(21000*3+1000*2+500*1)/22500≈2.91
Therefore, the intra-group mean of the time group is 2.91, and the number of frames closest to the intra-group mean is 3, so that the processor 71 retains 21000 video frames with the number of frames of 3 and screens out 1500 video frames with the number of frames of 2 and 1, respectively.
Second, processor 71 holds video frames for which the number of intra-group frames, which is the number of objects included in each video frame, is equal to the number of intra-group high frequencies, which is the number of intra-group frames that occur most frequently within the time group, and screens out the remaining video frames. For example, also for the time group, the number of the most frequently occurring intra frames in the group is 3, so that the processor 71 will keep 21000 video frames with the number of intra frames being 3 and screen out 1500 video frames with the number of intra frames being 2 and 1, respectively.
After the processor 71 filters out the video frames that are prone to aggregation bias in each time group, the processor may aggregate the structured information for each video frame in the group. After the intra-group structured semantic information for the time group is obtained through aggregation, the user can directly know what behavior is performed by the monitored target in the time corresponding to the time group based on the intra-group structured semantic information. For example, assuming that the number of frames in a certain time group k is J, the structured semantic information in the group of the k-th time group is
{ [ jth target, target behavior information of jth target]T|j∈[1,J]};
The aggregated intra-group structured semantic information may not include time information, but in order to make the user understand that the intra-group structured semantic information is specific to a certain time period, the processor 71 needs to additionally show the user a corresponding relationship between each time group and a specific time, for example, the specific time corresponding to the first time group is 10:00 to 10:15 in the morning. After a user sees the intra-group structured semantic information of the first time group, the intra-group structured semantic information can be determined to describe the monitoring content of 10: 00-10: 15 in the morning. In another example of the embodiment, the intra-group structured semantic information obtained after aggregation may also include time information of a time group.
In addition, in an example of the present embodiment, the processor 71 directly presents to the user not the intra-group structured semantic information, but obtains the description summary information according to the intra-group structured semantic information, and the description summary information converts the standard structured language into the text description and/or the picture description according to the content of each part in the intra-group structured semantic information, for example, the description summary information obtained after conversion is the text information: the method comprises the following steps that a first target is making an X event, a second target is making a Y event, and a third target is making a Z event at 10: 00-10: 15 in the morning.
The video processing device provided by the embodiment of the invention generates the set information containing the structured semantic information of each target in the video frame aiming at each video frame by carrying out target identification on each video frame in the basic video, then divides each video frame according to time, and finally carries out structured semantic information aggregation on each time group obtained by dividing, thereby obtaining the structured semantic information in the group aiming at the time group, so that a user can directly know the main information collected by the monitoring camera in the corresponding time period according to the structured semantic information in the group, and the method that the user spends a large amount of time when browsing the basic video is avoided. Meanwhile, when the processor acquires the basic video, the video frames which do not contain the motion information in the original video can be screened out, the processing amount of subsequent processing is reduced, the processing efficiency is improved, and the problem of processing resource waste is reduced.
On the other hand, after the processor obtains the in-group structured semantic information, the description summary information is generated according to the in-group structured semantic information, so that a user can know things happening in each time period more intuitively through the description summary information, and user experience is improved.
Example five:
the present embodiment will further describe the video processing apparatus 80 provided by the present invention with reference to fig. 8, and particularly describe the process of aggregating structured semantic information in detail.
The video processing apparatus in the present embodiment may be implemented in various forms. For example, mobile terminals such as mobile phones, tablet computers, notebook computers, palmtop computers, Personal Digital Assistants (PDAs), Portable Media Players (PMPs), navigation devices, wearable devices, smart bands, pedometers, and fixed terminals such as Digital TVs, desktop computers, and the like. The following description will be given by way of example of a mobile terminal, and it will be understood by those skilled in the art that the construction according to the embodiment of the present invention can be applied to a fixed type terminal, in addition to elements particularly used for mobile purposes.
In the present embodiment, the video processing apparatus 80 includes a processor 81, a memory 82, and a user input unit 83, a display unit 84. The memory 82 stores therein a video processing program for the processor 81 to read and execute, thereby implementing the video processing method. The user input unit 83 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the mobile terminal. Specifically, the user input unit 83 may include a touch panel and other input devices. The display unit 84 is used to display information input by the user or information provided to the user. The Display unit 84 may include a Display panel, which may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
Since the reason and the manner for the processor to screen out the video frames that do not include the motion information are explained in detail in the third embodiment, the process for the processor 81 to obtain the base video from the original video is not repeated here.
Similar to the first embodiment, the set information of a video frame includes the structured semantic information of each object in the video frame. In this embodiment, the structured semantic information still includes time information, target information and target behavior information, where the target information includes a target category, such as "human" or "animal" and the like; the target behavior information includes a target behavior position and a behavior category, where the behavior category includes "playing mobile phone", "watching television", "reading", "eating food", and the like, and the behavior category may be defined and set by a user according to a place corresponding to a base video, for example, for a base video collected in a living room, the behavior category includes "playing mobile phone", "watching television", "reading", "eating food", and the like, and for a base video originated from a dining room, the behavior category may be "eating food", "picking up table", and the like. The target action location refers to a coordinate location where the corresponding action occurs. The processor 81 needs to obtain the above four information when performing object recognition on one video frame.
In an example of the embodiment, the processor 81 quantizes the target information and the target behavior information in the structured semantic information, for example, "person" is "0", "cat" is "1", "dog" is quantized to "2" … …, and so on. Therefore, "0", "1", and "2" are standard object information of three object categories of "person", "cat", and "dog", respectively. Similarly, the processor 81 quantifies "eat", "watch tv", and "play the electronic device" with integers "3", "4", and "5", respectively, so that "3", "4", and "5" are standard behavior categories of the three behavior categories "eat", "watch tv", and "play the electronic device", respectively.
For the structured semantic information of a certain target in a certain video frame, the processor 81 may use a four-dimensional vector to represent, for example, the structured semantic information of the jth target in the ith frame is:
SMi,j=[FrNoi,Posi,j,Typei,j,Actioni,j]T
wherein, FrNoiIndicating the frame number of the ith frame, it being understood that this frame number may characterize the temporal information of the ith video frame. Posi,jIndicates the behavior position, Type, of the jth target in the ith framei,jIndicates the object type, Action, of the jth object in the ith framei,jRepresenting the behavior class of the jth object in the ith frame.
Assuming that the total number of objects contained in the ith frame, i.e. the number in the frame is J, the set information of the ith frame can be represented by the structured semantic information set of all objects in the ith frame, and assuming that the ith frame has J objects in total, the set information of the ith frame is
SMi={SMij|0<j<J}
After the processor 81 obtains the set information of each video frame in the base video, the video frames are divided into corresponding time groups according to a preset time length. It should be understood that the video frames are divided into corresponding time groups, that is, the structured semantic information corresponding to the video frames is divided into corresponding time groups. In the present embodiment, the preset time period may be input by the user through the user input unit 83 to a custom setting.
Assuming that a total of K time groups are obtained through the division of the processor 81, the K-th time group can be represented as
SMSegk={SMiI belongs to the kth time group
After the processor 81 divides each video frame in the base video frame into the corresponding time groups, the video frames in each time group, which may cause inaccurate aggregation result, will be first filtered out by the processor 81 in any one of the two ways in the first embodiment.
In a time group, the aggregate information of each video frame is actually the structured semantic information aggregate of each object in the video frame. Therefore, SMSegkIs actually a collection of structured semantic information for each object in each video frame over a period of time. The aggregation process is actually to obtain the structured semantic information based on each target in the time group, for example, to obtain the target X to do S at the P position in the time period T by aggregation. Therefore, the method for aggregating the structured semantic information corresponding to each video frame in a time group has the key point that the structured semantic information aiming at the same target in each video frame set information is found, and then the structured semantic information aiming at the same target is aggregated to obtain the intra-group structured semantic information aiming at the target.
Therefore, in order to aggregate the structured semantic information of each video frame in the kth time group, in this embodiment, the processor 81 needs to first find the structured semantic information for the same target in the set information of each video frame, and then aggregate the structured semantic information for the same target to obtain the intra-group structured semantic information for the target.
This is very simple if there is only one object in each video frame of the kth time group, and the set information of each video frame is composed of only the structured semantic information set of the unique object, so the processor 81 can directly take the unique structured semantic information in each set information as the structured semantic information for the object.
However, if J objects are included in each video frame of the kth time group, and J is greater than or equal to 2, then: processor 81 may find the structured semantic information for the same target in each set of information as follows:
for any two adjacent video frames, the processor 81 takes the arrangement order of each structured semantic information in the previous frame set information as the standard order. Assume that processor 81 is currently searching for structured semantic information for the same object at frame i and frame i +1 in the kth time group, and that there are three objects A, B, C in each video frame in the kth time group; in addition, the structured semantic information SM corresponding to the ith video frame is determinediThe time is described from left to right and from top to bottom according to the position of the target in the video frame. In the present embodiment, the frame before the two adjacent video frames is referred to as "preceding frame" and the other video frame is referred to as "succeeding frame", for example, the ith video frame (preceding frame) and the (i +1) th video frame (succeeding frame) are respectively shown in fig. 4 and 5, where SMiAnd SMi+1Set information for the ith frame and the (i +1) th frame, respectively:
SMi={SMi,1,SMi,2,SMi,3}
SMi+1={SMi+1,1,SMi+1,2,SMi+1,3}
wherein, SMi,1,SMi,2,SMi,3A, B, C structured semantic information in previous frames, respectively, and SMi+1,1,SMi+1,2,SMi+1,3Is also the structured semantic information of three objects in the following frame, but because different video frames correspond to different timeAt different times, the target may change position, so that it is currently not possible for the processor 81 to determine the SMi+1,1,SMi+1,2,SMi+1,3Whether it is exactly the corresponding target A, B, C. That is, the processor 81 cannot determine whether the object 51 in the following frame is the object 41 in the preceding frame, whether the object 52 in the following frame is the object 42 in the preceding frame, and whether the object 53 in the following frame is the object 43 in the preceding frame.
To determine SMi+1,1,SMi+1,2,SMi+1,3Which target corresponds to, in this embodiment, the processor 81 may first set the previous frame set information SMiMiddle { SMi,1,SMi,2,SMi,3The ordering of the fingers is taken as a standard ordering.
Then, the processor 81 sorts each structural semantic information in the post-frame set information to obtain J! And (5) sorting to be selected.
For the later frame, there may be 3! Candidate orderings, e.g. candidate orderings { SMi+1,1,SMi+1,2,SMi+1,3}, sorting to be selected { SMi+1,1,SMi+1,3,SMi+1,2}, sorting to be selected { SMi+1,3,SMi+1,2,SMi+1,1… …, etc. It should be appreciated that when J targets are included in a time group, the candidate ordering also has J! And (4) seed preparation.
Since the structured semantic information corresponding to the previous frame includes the behavior position of each object, and the structured semantic information corresponding to the subsequent frame also includes the behavior position of each object, for each candidate sequence, the processor 81 may calculate distance information S1 between the first behavior position in the previous frame and the first behavior position in the subsequent frame, distance information S2 between the second behavior position in the previous frame and the second behavior position in the subsequent frame, and distance information S3 between the third behavior position in the previous frame and the third behavior position in the subsequent frame. For the scenario with the number of J in the frame, the calculation method is also analogized in sequence.
Subsequently, the processor 81 calculates the sum of the distances between the action positions in the to-be-selected ranking and the action positions corresponding to the sequence numbers in the standard sequence. For example, for a scenario with a frame number of 3, the processor 81 may calculate the distance S between each object in each candidate sequence and each object in the standard sequence. For example, for the xth candidate rank, the sum of distances Sx is S1+ S2+ S3.
Finally, processor 81 may be configured from J! And selecting one of the sequences to be selected with the smallest distance and the smallest S, and taking the sequence to be selected as the standard sequence of each structural semantic information in the later frame. For example, for the example above where the number of frames is 3, the final calculation determines the distance sum of the first candidate ordering, which is { SM }, to be the minimumi+1,1,SMi+1,2,SMi+1,3I.e. SM in the following framei+1,1,SMi+1,2,SMi+1,3Corresponding to a, B, C, respectively. Namely SMiAnd SMi+1The first structured semantic information is for target a, the second structured semantic information is for target B, and the third structured semantic information is for target C. The processing for other video frames is similar.
After determining the structured semantic information for the same object in the time group, the processor 81 may aggregate the structured semantic information for the object to obtain the group structured semantic information for the object. It is assumed that the structured semantic information for the first object within a time group has respective SM1,1、SM2,1And SM3,1Wherein
SM1,1=[1,(1,2),0,5]T
SM2,1=[2,(1,3),1,6]T
SM3,1=[3,(1,2),0,6]T
Firstly, the processor 81 calculates average target information and average target behavior information of the target within a time group according to the semantic structural information corresponding to each video frame, and for the target in this example, the processor 81 determines the average behavior position as ((1+1+1)/3, (2+3+2)/3), that is, (1, 7/3); the average target class is (0+1+0)/3, namely 1/3; the average target behavior category is (5+6+6)/3, i.e., 17/3.
Then, the processor 81 selects actual target information and actual target behavior information for the target from preset standard target information and standard target behavior information, wherein the actual target information is the standard target information closest to the average target information, and the actual target behavior information is the standard target behavior information closest to the average target behavior information. For example, the standard target information of three target categories of "person", "cat" and "dog" has been introduced as "0", "1" and "2", and the standard target behavior information of three behaviors of "eat", "watch tv" and "play electronic equipment" is "3", "4" and "5", so that it can be finally known through matching that the target has the intra-group structured semantic information of the time group
SMReality i=[(1,7/3),0,6]T
In fact, after obtaining the structured semantic information, the processor 81 may directly present the information to the user through the display unit 84, and simultaneously show the correspondence between each coordinate position and each functional area of the family, the correspondence between each standard target information and the corresponding standard target, and the correspondence between each standard target behavior information and the corresponding standard target behavior to the user, so that the user can clarify the meaning represented by the structured semantic information in the group according to the correspondences.
In an example of this embodiment, the processor 81 determines the meaning of the structured semantic information in the group according to the actual target information, the actual target behavior information and the relevant corresponding relationship, obtains the description summary information for the time group, and then shows the description summary information to the user through the display unit 84, so that the user can know what happens in the corresponding time.
The video processing apparatus provided in this embodiment can convert image information recorded in a basic video frame into structured semantic information including time information, and aggregate the structured semantic information of each object in a time period by using the characteristic that an object activity has persistence to obtain intra-group structured semantic information of the time period. When a user needs to know the video information recorded in the basic video, the information recorded in the basic video can be basically known based on the in-group structured semantic information corresponding to each time period in the basic video. The method greatly reduces the browsing amount of the user, reduces the burden of the user and improves the user experience.
The present invention also provides a computer readable storage medium, which can store a video processing program for a processor to read, compile and execute the video processing method implemented in the foregoing embodiments.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented in a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented in program code executable by a computing device, such that they may be stored on a computer storage medium (ROM/RAM, magnetic disk, optical disk) and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The foregoing is a more detailed description of embodiments of the present invention, and the present invention is not to be considered limited to such descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (9)
1. A video processing method, comprising:
carrying out target identification on video frames in a basic video, and generating set information for each video frame according to an identification result, wherein the set information comprises structured semantic information of each target in the video frames, and the structured semantic information comprises time information, target behavior information and behavior positions of the targets which are arranged according to a preset sequence;
dividing each video frame into corresponding time groups according to the time information corresponding to each video frame;
aggregating the structured semantic information corresponding to each video frame in the time group to obtain the intra-group structured semantic information aiming at the time group; if only one target exists in each video frame of the time group, taking the unique structured semantic information in each set of information as the structured semantic information aiming at the target;
if each video frame of the time group comprises J targets, wherein J is greater than or equal to 2, then:
for any two adjacent video frames, taking the arrangement sequence of each structural semantic information in the previous frame set information as a standard sequence;
sequencing each structural semantic information in the post-frame set information to obtain J! Sorting to be selected;
respectively calculating the distance between two behavior positions with the same sequence number in the sequence to be selected and the sequence number in the standard sequence according to the behavior positions in each structural semantic information in the front frame and the rear frame;
calculating the distance sum of the action positions corresponding to the serial numbers in the standard sequence and the action positions in the to-be-selected sequence;
selecting the minimum distance and the minimum candidate sequence as the standard sequence of the back frame set information, wherein the jth structured semantic information in the back frame set information and the jth structured semantic information in the front frame set information aim at the same target, and J is greater than or equal to 1 and less than or equal to J.
2. The video processing method of claim 1, wherein prior to performing object recognition on each video frame in the base video, further comprising:
detecting motion information of each video frame in an original video;
and screening out video frames which do not contain motion information in the original video to obtain the basic video.
3. The video processing method of claim 1, wherein said dividing each of the video frames into corresponding time groups according to the time information corresponding to each of the video frames comprises:
dividing each video frame in the basic video into corresponding time groups according to preset time length set by a user and time information corresponding to each video frame, wherein the time difference between an initial frame and an ending frame in the divided time groups does not exceed the preset time length.
4. The video processing method of claim 1, wherein before aggregating the structured semantic information corresponding to each of the video frames in the temporal group, further comprising:
calculating the intra-group mean value of the number of the video frames in the time group, reserving the video frames with the intra-group number closest to the intra-group mean value, and screening out the residual video frames, wherein the intra-group number is the number of targets contained in each video frame;
or the like, or, alternatively,
and reserving video frames with the number in the time group frame equal to the number of high frequencies in the time group, and screening out the residual video frames, wherein the number in the frame is the number of targets contained in each video frame, and the number of the high frequencies in the time group is the number in the frame with the highest occurrence frequency in the time group.
5. The video processing method according to any of claims 1-4, wherein said aggregating the structured semantic information corresponding to each of the video frames in the temporal group comprises:
searching structured semantic information aiming at the same target in the set information of each video frame;
and aggregating the structural semantic information aiming at the target to obtain the intra-group structural semantic information aiming at the target.
6. The video processing method of claim 5, wherein the aggregating the structured semantic information for the target to obtain the intra-group structured semantic information for the target comprises:
calculating average target information and average target behavior information of the target in the time group according to the structural semantic information corresponding to each video frame;
selecting actual target information and actual target behavior information for the target from preset standard target information and standard target behavior information, wherein the actual target information and the actual target behavior information are respectively the standard target information and the standard target behavior information which are closest to the average target information and the average target behavior information.
7. The video processing method of claim 6, wherein the object information comprises object classes, the object behavior information comprises behavior classes of objects, and the structured semantic information for a jth object in an ith frame is:
SMi,j=[FrNoi,Posi,j,Typei,j,Actioni,j]T
the FrNoiA frame number representing the ith frame, the frame number being indicative of temporal information of the video frame, the Posi,jRepresenting the behavior position of the jth target in the ith frame, the Typei,jThe object type of the jth object in the ith frame is represented, and the Actioni,jRepresenting the behavior category of the jth target in the ith frame;
the aggregate information of the ith frame is
SMi={SMij|0<j<J}
J is the intra number of the ith frame.
8. The video processing method according to any of claims 1 to 4, wherein after aggregating the structured semantic information corresponding to each of the video frames in the time group to obtain the intra-group structured semantic information for the time group, further comprising: and obtaining description summary information aiming at the time group according to the structured semantic information in the group, wherein the description summary information comprises words and/or pictures.
9. A video processing apparatus, comprising a processor, a memory, and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute a video processing program stored in the memory to implement the steps of:
carrying out target identification on video frames in a basic video, and generating set information for each video frame according to an identification result, wherein the set information comprises structured semantic information of each target in the video frames, and the structured semantic information comprises time information, target behavior information and behavior positions of the targets which are arranged according to a preset sequence;
dividing each video frame into corresponding time groups according to the time information corresponding to each video frame;
aggregating the structured semantic information corresponding to each video frame in the time group to obtain the intra-group structured semantic information aiming at the time group; if only one target exists in each video frame of the time group, taking the unique structured semantic information in each set of information as the structured semantic information aiming at the target;
if each video frame of the time group comprises J targets, wherein J is greater than or equal to 2, then:
for any two adjacent video frames, taking the arrangement sequence of each structural semantic information in the previous frame set information as a standard sequence;
sequencing each structural semantic information in the post-frame set information to obtain J! Sorting to be selected;
respectively calculating the distance between two behavior positions with the same sequence number in the sequence to be selected and the sequence number in the standard sequence according to the behavior positions in each structural semantic information in the front frame and the rear frame;
calculating the distance sum of the action positions corresponding to the serial numbers in the standard sequence and the action positions in the to-be-selected sequence;
selecting the minimum distance and the minimum candidate sequence as the standard sequence of the back frame set information, wherein the jth structured semantic information in the back frame set information and the jth structured semantic information in the front frame set information aim at the same target, and J is greater than or equal to 1 and less than or equal to J.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710551156.2A CN109218660B (en) | 2017-07-07 | 2017-07-07 | Video processing method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710551156.2A CN109218660B (en) | 2017-07-07 | 2017-07-07 | Video processing method and device |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109218660A CN109218660A (en) | 2019-01-15 |
| CN109218660B true CN109218660B (en) | 2021-10-12 |
Family
ID=64990903
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201710551156.2A Active CN109218660B (en) | 2017-07-07 | 2017-07-07 | Video processing method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109218660B (en) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110428522B (en) * | 2019-07-24 | 2021-04-30 | 青岛联合创智科技有限公司 | Intelligent security system of wisdom new town |
| CN110874424A (en) * | 2019-09-23 | 2020-03-10 | 北京旷视科技有限公司 | Data processing method and device, computer equipment and readable storage medium |
| CN110677722A (en) * | 2019-09-29 | 2020-01-10 | 上海依图网络科技有限公司 | Video processing method, and apparatus, medium, and system thereof |
| CN113190710B (en) * | 2021-04-27 | 2023-05-02 | 南昌虚拟现实研究院股份有限公司 | Semantic video image generation method, semantic video image playing method and related devices |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101902617A (en) * | 2010-06-11 | 2010-12-01 | 公安部第三研究所 | A Device and Method for Realizing Video Structured Description Using DSP and FPGA |
| CN102819528A (en) * | 2011-06-10 | 2012-12-12 | 中国电信股份有限公司 | Method and device for generating video abstraction |
| US9271035B2 (en) * | 2011-04-12 | 2016-02-23 | Microsoft Technology Licensing, Llc | Detecting key roles and their relationships from video |
| CN106210612A (en) * | 2015-04-30 | 2016-12-07 | 杭州海康威视数字技术股份有限公司 | Method for video coding, coding/decoding method and device thereof |
-
2017
- 2017-07-07 CN CN201710551156.2A patent/CN109218660B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101902617A (en) * | 2010-06-11 | 2010-12-01 | 公安部第三研究所 | A Device and Method for Realizing Video Structured Description Using DSP and FPGA |
| US9271035B2 (en) * | 2011-04-12 | 2016-02-23 | Microsoft Technology Licensing, Llc | Detecting key roles and their relationships from video |
| CN102819528A (en) * | 2011-06-10 | 2012-12-12 | 中国电信股份有限公司 | Method and device for generating video abstraction |
| CN106210612A (en) * | 2015-04-30 | 2016-12-07 | 杭州海康威视数字技术股份有限公司 | Method for video coding, coding/decoding method and device thereof |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109218660A (en) | 2019-01-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11196930B1 (en) | Display device content selection through viewer identification and affinity prediction | |
| US11070728B2 (en) | Methods and systems of multi-camera with multi-mode monitoring | |
| Boult et al. | Into the woods: Visual surveillance of noncooperative and camouflaged targets in complex outdoor settings | |
| US9922271B2 (en) | Object detection and classification | |
| CN109218660B (en) | Video processing method and device | |
| US11087167B2 (en) | First-person camera based visual context aware system | |
| US10303984B2 (en) | Visual search and retrieval using semantic information | |
| Zhao et al. | Learning saliency-based visual attention: A review | |
| US20200118168A1 (en) | Advertising method, device and system, and computer-readable storage medium | |
| US20130308856A1 (en) | Background Detection As An Optimization For Gesture Recognition | |
| Höferlin et al. | Information-based adaptive fast-forward for visual surveillance | |
| CN111222373B (en) | Personnel behavior analysis method and device and electronic equipment | |
| KR20080075091A (en) | Storage of video analytics data for real time alerts and forensic analysis | |
| US20230067154A1 (en) | Monitoring apparatus and system | |
| DE112016004160T5 (en) | UI for video summaries | |
| AU2016262874A1 (en) | Systems, methods, and devices for information sharing and matching | |
| US10417884B2 (en) | Method and system for incident sharing in a monitoring system | |
| CN116193049A (en) | Video processing method, device, computer equipment and storage medium | |
| US9727312B1 (en) | Providing subject information regarding upcoming images on a display | |
| Sahoo et al. | A technical analysis of digital image and video processing | |
| TW201523459A (en) | Object tracking method and electronic apparatus | |
| Miniakhmetova et al. | An approach to personalized video summarization based on user preferences analysis | |
| US10706601B2 (en) | Interface for receiving subject affinity information | |
| KR20120050660A (en) | Face searching system and method based on face recognition | |
| WO2014031538A1 (en) | Background detection as an optimization for gesture recognition |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |