CN101651772A - Method for extracting video interested region based on visual attention - Google Patents
Method for extracting video interested region based on visual attention Download PDFInfo
- Publication number
- CN101651772A CN101651772A CN200910152520A CN200910152520A CN101651772A CN 101651772 A CN101651772 A CN 101651772A CN 200910152520 A CN200910152520 A CN 200910152520A CN 200910152520 A CN200910152520 A CN 200910152520A CN 101651772 A CN101651772 A CN 101651772A
- Authority
- CN
- China
- Prior art keywords
- video frame
- depth
- pixel
- current
- visual attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 230
- 238000000034 method Methods 0.000 title claims abstract description 57
- 230000033001 locomotion Effects 0.000 claims abstract description 182
- 230000003068 static effect Effects 0.000 claims abstract description 57
- 230000008447 perception Effects 0.000 claims abstract description 15
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 6
- 238000010586 diagram Methods 0.000 claims description 48
- 230000006870 function Effects 0.000 claims description 25
- 230000007704 transition Effects 0.000 claims description 24
- 238000001514 detection method Methods 0.000 claims description 23
- 230000008569 process Effects 0.000 claims description 22
- 238000010606 normalization Methods 0.000 claims description 12
- 238000000354 decomposition reaction Methods 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 9
- 230000010339 dilation Effects 0.000 claims description 7
- 238000007500 overflow downdraw method Methods 0.000 claims description 7
- 238000012805 post-processing Methods 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 5
- 230000000877 morphologic effect Effects 0.000 claims description 4
- 230000003287 optical effect Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 abstract description 19
- 239000000284 extract Substances 0.000 abstract description 12
- 238000004364 calculation method Methods 0.000 abstract description 3
- 230000002708 enhancing effect Effects 0.000 abstract description 2
- 238000012360 testing method Methods 0.000 description 55
- 230000000694 effects Effects 0.000 description 8
- 238000012545 processing Methods 0.000 description 5
- 210000004556 brain Anatomy 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000008433 psychological processes and functions Effects 0.000 description 1
- 230000000638 stimulation Effects 0.000 description 1
- 210000000857 visual cortex Anatomy 0.000 description 1
Images
Landscapes
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Image Analysis (AREA)
Abstract
本发明公开了一种基于视觉注意的视频感兴趣区域的提取方法,该方法提取的感兴趣区域融合了静态图像域视觉注意、运动视觉注意和深度视觉注意,有效抑制各视觉注意提取内在的单一性和不准确性,解决了静态图像域视觉注意中的复杂背景引起的噪声问题,解决了运动视觉注意无法提取局部运动和运动幅度小的感兴趣区域,从而提高计算精度,增强算法的稳定性,能够从纹理复杂的背景和运动环境中提取出感兴趣区域;另外,通过该方法获取的感兴趣区域除符合人眼对静态纹理视频帧的视觉感兴趣特性和人眼对运动对象感兴趣的视觉特性外,还符合在立体视觉中对深度感强或距离近的对象感兴趣的深度感知特性,符合人眼立体视觉的语义特征。
The invention discloses a method for extracting a region of interest in a video based on visual attention. The region of interest extracted by the method combines visual attention in static image domain, visual attention in movement and depth visual attention, effectively suppressing the inherent singleness of each visual attention extraction. It solves the noise problem caused by the complex background in the visual attention of the static image domain, and solves the problem that the motion visual attention cannot extract the local motion and the region of interest with small motion range, thereby improving the calculation accuracy and enhancing the stability of the algorithm , can extract the region of interest from the background with complex texture and moving environment; in addition, the region of interest obtained by this method is not only in line with the human eye's visual interest characteristics of static texture video frames and the human eye's interest in moving objects In addition to visual characteristics, it also conforms to the depth perception characteristics of being interested in objects with a strong sense of depth or close distance in stereo vision, and conforms to the semantic characteristics of human stereo vision.
Description
技术领域 technical field
本发明涉及一种视频信号的处理方法,尤其是涉及一种基于视觉注意的视频感兴趣区域的提取方法。The invention relates to a method for processing video signals, in particular to a method for extracting video interest regions based on visual attention.
背景技术 Background technique
立体电视,又称3DTV(Three Dimensional Television,三维电视),由于立体电视能够提供从平面到立体的跨越,给予观看者特有的立体感和真实感,因此受到了国内外研究机构和产业界的高度重视。2002年,在欧洲委员会支持的IST计划中启动了一个ATTEST(高级三维电视系统技术)项目,该项目目标致力于建立一条完整的可向后兼容的三维数字电视广播链系统。ATTEST项目的目标是提出一个3DTV广播链的新理念,与现有的二维广播实现向下兼容,并广泛地支持各种不同形式的二维和三维显示。ATTEST项目的主要设计理念在于提出了在传统二维视频图像传输的基础上,增加深度图(Depth Map)作为增强层信息,即“二维彩色视频加深度”的数据表示,以二维彩色视频加深度的方式在显示终端解码、重建三维视频,而且业界部分先进裸眼自由立体显示终端也已支持二维彩色视频加深度的显示模式。Stereoscopic TV, also known as 3DTV (Three Dimensional Television, three-dimensional TV), because stereoscopic TV can provide a leap from flat to three-dimensional, and give viewers a unique sense of three-dimensionality and reality, it has been highly recognized by research institutions and industries at home and abroad. Pay attention to. In 2002, an ATTEST (Advanced 3D Television System Technology) project was launched in the IST program supported by the European Commission. The project aims to establish a complete backward compatible 3D digital TV broadcasting chain system. The goal of the ATTEST project is to propose a new concept of 3DTV broadcasting chain, realize backward compatibility with existing 2D broadcasting, and widely support various forms of 2D and 3D display. The main design concept of the ATTEST project is to propose adding a depth map (Depth Map) as the enhancement layer information on the basis of traditional two-dimensional video image transmission, that is, the data representation of "two-dimensional color video plus depth". The way of adding depth is to decode and reconstruct 3D video on the display terminal, and some advanced naked-eye autostereoscopic display terminals in the industry have also supported the display mode of adding depth to 2D color video.
在人类视觉接收与处理系统中,由于大脑资源有限以及外界环境信息重要性区别,在处理过程中人脑对外界环境信息并不是一视同仁的,而是表现出选择特性,即感兴趣程度不同。一直以来,视频感兴趣区域的提取是视频压缩与通信、视频检索、模式识别等领域中基于内容的视频处理方法的核心和难点技术之一。视觉心理学研究表明,人眼的这种对于外界视觉输入的选择性或感兴趣程度的差异性,与人的视觉注意特性存在密不可分的联系。目前,视觉注意力线索研究主要划分为两个方面展开:自顶向下(Top-down)(也称概念驱动,Concept-driven)的注意力线索和自底向上(Bottom-up)(也称刺激驱动,Stimulus-driven)的注意力线索。自顶向下的注意力线索主要来自复杂的心理过程,并直接注意与场景中的某些对象,包括对象形状、动作以及模式等其他相关的识别特征,该线索受个人知识、兴趣爱好、潜意识等因素的影响,因人而异。另一种线索是自底向上的注意力线索,主要来自视频场景的视觉特征因素的对视皮层引起的直接刺激,主要包括颜色、亮度、方向等刺激,自底向上的注意力线索本能的、自动的,具有较好的普遍适用性,且相对稳定,基本不受个人知识、爱好等意识因素的影响,所以自底向上的注意力线索是自动感兴趣区域的提取方法研究的热点内容之一。In the human visual receiving and processing system, due to limited brain resources and differences in the importance of external environmental information, the human brain does not treat external environmental information equally in the processing process, but shows selective characteristics, that is, different degrees of interest. For a long time, the extraction of video regions of interest has been one of the core and difficult technologies of content-based video processing methods in the fields of video compression and communication, video retrieval, and pattern recognition. Visual psychology studies have shown that the human eye's selectivity or interest in external visual input is inseparable from the human visual attention characteristics. At present, the research on visual attention cues is mainly divided into two aspects: top-down (Top-down) (also called Concept-driven) attention cues and bottom-up (Bottom-up) (also called Concept-driven) attention cues. Stimulus-driven attentional cues. Top-down attention cues mainly come from complex psychological processes, and directly pay attention to certain objects in the scene, including object shapes, actions, patterns and other related identification features, which are influenced by personal knowledge, hobbies, subconscious The influence of other factors varies from person to person. Another kind of clue is the bottom-up attention clue, which mainly comes from the direct stimulation of the visual cortex caused by the visual feature factors of the video scene, mainly including color, brightness, direction and other stimuli. The bottom-up attention clue is instinctive, Automatic, has good universal applicability, and is relatively stable, and is basically not affected by conscious factors such as personal knowledge and hobbies. Therefore, bottom-up attention clues are one of the hot topics in the research of automatic region-of-interest extraction methods. .
然而,目前自动感兴趣区域的提取主要分为三类,1)、利用单个视点的图像内部信息,包括亮度、颜色、纹理或方向等刺激信息,提取人眼对当前视频帧的感兴趣区域,该方法主要提取亮度、颜色和纹理对比差异性较大的区域作为感兴趣区域,这样使得该方法难以适用于复杂背景环境的感兴趣区域提取;2)、基于人眼对运动区域感兴趣的视觉原理,利用视频帧间的运动信息作为主要线索来提取感兴趣区域,然而这种方法对于缓慢运动或局部运动的对象却难以准确提取,也难以适用于全局运动情况下的感兴趣区域提取;3)、采用静态纹理和运动信息相结合的提取方法,这种方法由于静态纹理与运动信息间的冗余和相关性较弱,并不能有效抑制各自存在的提取误差和噪声,从而使得提取精度不高。这三类传统方法由于可以利用的信息量的限制引起提取的感兴趣区域不够准确,稳定性欠佳;另一方面,传统方法并未考虑对深度感强或距离观看者较近的对象感兴趣的立体视觉特性,不能很好的表现具有立体视觉的人眼真正的感兴趣程度,从而难以适用于新一代立体(三维)/多视点视频中的符合立体视觉语义特征的感兴趣区域提取。However, the current automatic region of interest extraction is mainly divided into three categories, 1) using the internal information of a single viewpoint image, including stimulus information such as brightness, color, texture or direction, to extract the region of interest of the human eye on the current video frame, This method mainly extracts regions with large differences in brightness, color and texture contrast as regions of interest, which makes this method difficult to apply to the region of interest extraction in complex background environments; The principle is to use the motion information between video frames as the main clue to extract the region of interest. However, this method is difficult to accurately extract slow or local moving objects, and it is also difficult to apply to the region of interest extraction in the case of global motion; 3 ), using the extraction method combining static texture and motion information. Due to the weak redundancy and correlation between static texture and motion information, this method cannot effectively suppress the existing extraction errors and noises, so that the extraction accuracy is not good. high. Due to the limitation of the amount of information available, these three types of traditional methods cause the extracted regions of interest to be inaccurate and poor in stability; on the other hand, the traditional methods do not consider being interested in objects with a strong sense of depth or closer to the viewer. The stereoscopic characteristics of stereoscopic vision cannot well represent the true degree of interest of the human eye with stereoscopic vision, so it is difficult to apply to the region of interest extraction that conforms to the semantic features of stereoscopic vision in the new generation of stereoscopic (3D)/multi-viewpoint videos.
发明内容 Contents of the invention
本发明所要解决的技术问题是提供一种能够使提取得到的视频感兴趣区域的精度较高、稳定性较好,且所提取的视频感兴趣区域符合人眼立体视觉语义特征的基于视觉注意的视频感兴趣区域的提取方法。The technical problem to be solved by the present invention is to provide a visual attention-based system that can make the extracted video region of interest have higher precision and better stability, and the extracted video region of interest conforms to the semantic characteristics of human stereo vision. Method for extracting regions of interest from videos.
本发明解决上述技术问题所采用的技术方案为:一种基于视觉注意的视频感兴趣区域的提取方法,包括以下步骤:The technical solution adopted by the present invention to solve the above-mentioned technical problems is: a method for extracting a video region of interest based on visual attention, comprising the following steps:
①将二维彩色视频定义为纹理视频,定义纹理视频中各时刻的纹理视频帧的尺寸大小均为W×H,W为纹理视频中各时刻的纹理视频帧的宽,H为纹理视频中各时刻的纹理视频帧的高,记纹理视频中t时刻的纹理视频帧为Ft,定义纹理视频中t时刻的纹理视频帧Ft为当前纹理视频帧,采用公知的静态图像视觉注意检测方法检测当前纹理视频帧的静态图像域视觉注意,得到当前纹理视频帧的静态图像域视觉注意的分布图,记为SI,当前纹理视频帧的静态图像域视觉注意的分布图SI的尺寸大小为W×H且其为ZS比特深度表示的灰度图;①Define the two-dimensional color video as texture video, and define the size of the texture video frame at each moment in the texture video as W×H, where W is the width of the texture video frame at each moment in the texture video, and H is the The height of the texture video frame at time, record the texture video frame at time t in the texture video as F t , define the texture video frame F t at time t in the texture video as the current texture video frame, and use the known static image visual attention detection method to detect The static image domain visual attention of the current texture video frame obtains the distribution map of the static image domain visual attention of the current texture video frame, which is denoted as S I , and the size of the distribution map S I of the static image domain visual attention of the current texture video frame is: W×H and it is a grayscale image represented by Z S bit depth;
②采用运动视觉注意检测方法检测当前纹理视频帧的运动视觉注意,得到当前纹理视频帧的运动视觉注意的分布图,记为SM,当前纹理视频帧的运动视觉注意的分布图SM的尺寸大小为W×H且其为ZS比特深度表示的灰度图;② Use the motion visual attention detection method to detect the motion visual attention of the current texture video frame, and obtain the distribution map of the motion visual attention of the current texture video frame, denoted as S M , the size of the motion visual attention distribution map S M of the current texture video frame A grayscale image of size W×H and Z bit depth representation;
③定义纹理视频对应的深度视频中各时刻的深度视频帧为ZD比特深度表示的灰度图,将深度视频中各时刻的深度视频帧的尺寸大小均设置为W×H,W为深度视频中各时刻的深度视频帧的宽,H为深度视频中各时刻的深度视频帧的高,记深度视频中t时刻的深度视频帧为Dt,定义深度视频中t时刻的深度视频帧Dt为当前深度视频帧,采用深度视觉注意检测方法检测当前深度视频帧与当前纹理视频帧联合展现的三维视频图像的深度视觉注意,得到三维视频图像的深度视觉注意的分布图,记为SD,三维视频图像的深度视觉注意的分布图SD的尺寸大小为W×H且其为ZS比特深度表示的灰度图;③ Define the depth video frame at each moment in the depth video corresponding to the texture video as a grayscale image represented by Z D bit depth, set the size of the depth video frame at each moment in the depth video to W×H, W is the depth video The width of the depth video frame at each moment in the depth video, H is the height of the depth video frame at each moment in the depth video, record the depth video frame at time t in the depth video as D t , define the depth video frame D t at time t in the depth video is the current depth video frame, using the depth visual attention detection method to detect the depth visual attention of the 3D video image jointly displayed by the current depth video frame and the current texture video frame, and obtain the distribution map of the depth visual attention of the 3D video image, denoted as SD , The distribution map SD of the depth visual attention of three-dimensional video image has a size of W×H and it is a grayscale image represented by Z S bit depth;
④采用基于深度感知的视觉注意融合方法将当前纹理视频帧的静态图像域视觉注意的分布图SI、当前纹理视频帧的运动视觉注意的分布图SM、当前深度视频帧及当前深度视频帧与当前纹理视频帧联合展现的三维视频图像的深度视觉注意的分布图SD融合,以提取符合人眼立体感知的三维视觉注意的分布图,记为S,三维视觉注意的分布图S的尺寸大小为W×H且其为ZS比特深度表示的灰度图;④Adopt the visual attention fusion method based on depth perception to combine the static image domain visual attention distribution map S I of the current texture video frame, the motion visual attention distribution map S M of the current texture video frame, the current depth video frame and the current depth video frame The depth visual attention distribution map S D of the 3D video image jointly displayed with the current texture video frame is fused to extract the 3D visual attention distribution map that conforms to the stereoscopic perception of the human eye, denoted as S, the size of the 3D visual attention distribution map S A grayscale image of size W×H and Z bit depth representation;
⑤对三维视觉注意的分布图S进行阈值化和宏块化后处理,获取当前纹理视频帧的最终的符合人眼立体感知的感兴趣区域;⑤ Thresholding and macro-blocking post-processing are performed on the distribution map S of 3D visual attention, and the final region of interest conforming to the stereoscopic perception of the human eye is obtained for the current texture video frame;
⑥重复步骤①~⑤直至处理完纹理视频中的所有纹理视频帧,获取纹理视频的视频感兴趣区域。⑥Repeat steps ①~⑤ until all the texture video frames in the texture video are processed, and the video ROI of the texture video is obtained.
所述的步骤②中的运动视觉注意检测方法的具体过程为:The specific process of the motion visual attention detection method in the described
②-1、记纹理视频中与当前纹理视频帧时间上连续的t+j时刻的纹理视频帧为Ft+j,记纹理视频中与当前纹理视频帧时间上连续的t-j时刻的纹理视频帧为Ft-j,其中,j∈(0,NF/2],NF为小于10的正整数;②-1. Record the texture video frame at time t+j continuous with the current texture video frame in the texture video as F t+j , record the texture video frame at the time tj time continuous with the current texture video frame in the texture video is F tj , where, j∈(0, N F /2], N F is a positive integer less than 10;
②-2、采用公知的光流法计算当前纹理视频帧与t+j时刻的纹理视频帧Ft+j在水平方向的运动向量图像和竖直方向的运动向量图像,及当前纹理视频帧与t-j时刻的纹理视频帧Ft-j在水平方向的运动向量图像和竖直方向的运动向量图像,记当前纹理视频帧与t+j时刻的纹理视频帧Ft+j在水平方向的运动向量图像为Vt+j H及竖直方向的运动向量图像为Vt+j V,记当前纹理视频帧与t-j时刻的纹理视频帧Ft-j在水平方向的运动向量图像为Vt-j H及竖直方向的运动向量图像为Vt-j V,Vt+j H、Vt+j V、Vt-j H及Vt-j V的宽为W及高为H;②-2, using the known optical flow method to calculate the motion vector image of the current texture video frame and the texture video frame F t+j at the moment t +j in the horizontal direction and the motion vector image of the vertical direction, and the current texture video frame and The motion vector image of the texture video frame F tj in the horizontal direction and the motion vector image in the vertical direction of the texture video frame F tj at the moment tj, record the motion vector image of the current texture video frame and the texture video frame F t+j of the moment t +j in the horizontal direction as The motion vector image of V t+j H and the vertical direction is V t+j V , and the motion vector image of the current texture video frame and the texture video frame F tj at the moment tj in the horizontal direction is V tj H and the motion vector image of the vertical direction The motion vector image is V tj V , the width of V t+j H , V t+j V , V tj H and V tj V is W and the height is H;
②-3、将Vt+j H的绝对值与Vt+j V的绝对值叠加得到当前纹理视频帧与t+j时刻的纹理视频帧Ft+j的运动幅度图像,记为Mt+j,
②-4、利用当前纹理视频帧和t+j时刻的纹理视频帧Ft+j及t-j时刻的纹理视频帧Ft-j,提取联合运动图,记为Mj Δ,提取联合运动图Mj Δ的具体过程为:判断当前纹理视频帧与t+j时刻的纹理视频帧Ft+j的运动幅度图像Mt+j中的各个像素和当前纹理视频帧与t-j时刻的纹理视频帧Ft-j的运动幅度图像Mt-j中对应坐标的像素的运动幅度值中的最小值是否大于设定的第一阈值T1,如果是,则确定联合运动图Mj Δ中相应坐标的像素的像素值为Mt+j和Mt-j中对应坐标的像素的运动幅度值之和的平均,否则,确定联合运动图Mj Δ中相应坐标的像素的像素值为0;对于Mt+j中坐标为(x,y)的像素和Mt-j中坐标为(x,y)的像素,判断min(mt+j(x,y),mt-j(x,y))是否大于设定的第一阈值T1,如果是,则确定联合运动图Mj Δ中坐标为(x,y)的像素的像素值为否则,确定联合运动图Mj Δ中坐标为(x,y)的像素的像素值为0,其中,min()为取最小值函数;②-4. Use the current texture video frame and the texture video frame F t+j at time t+j and the texture video frame F tj at time tj to extract the joint motion map, denoted as M j Δ , and extract the joint motion map M j Δ The specific process is: judging the current texture video frame and the texture video frame F t+j at the time t+j of each pixel in the motion range image M t+j and the current texture video frame and the texture video frame F tj at the time tj Whether the minimum value of the motion amplitude value of the pixel corresponding to the coordinate in the motion amplitude image M tj is greater than the set first threshold T 1 , if yes, then determine the pixel value M of the pixel corresponding to the coordinate in the joint motion map M j Δ The average of the sum of the motion amplitude values of the pixels corresponding to the coordinates in t+j and M tj , otherwise, the pixel value of the pixel corresponding to the coordinates in the joint motion map M j Δ is determined to be 0; for M t+j , the coordinates are (x , y) and the pixel whose coordinates are (x, y) in M tj , determine whether min(m t+j (x, y), m tj (x, y)) is greater than the set first threshold T 1 , if yes, determine the pixel value of the pixel with coordinates (x, y) in the joint motion map M j Δ Otherwise, determine that the pixel value of the pixel whose coordinates are (x, y) in the joint motion map M j Δ is 0, where min() is a minimum value function;
②-5、将在时间上与t时刻距离1时刻至NF/2时刻的各个时刻的联合运动图加权叠加得到当前纹理视频帧的加权联合运动图,记为M,记当前纹理视频帧的加权联合运动图M中坐标为(x,y)的像素的像素值为m(x,y),
②-6、对当前纹理视频帧的加权联合运动图M进行高斯金字塔分解,分解成nL层加权联合运动图,记加权联合运动图M高斯金字塔分解后得到的第i层加权联合运动图为M(i),第i层加权联合运动图M(i)的宽和高分别为W/2i和H/2i,其中,nL为小于20的正整数,i∈[0,nL-1],W为当前纹理视频帧的宽,H为当前纹理视频帧的高;②-6. Decompose the weighted joint motion map M of the current texture video frame into a Gaussian pyramid, decompose it into n L layers of weighted joint motion maps, and record the i-th layer weighted joint motion map obtained after the Gaussian pyramid decomposition of the weighted joint motion map M is M(i), the width and height of the i-th layer weighted joint motion map M(i) are W/2 i and H/2 i respectively, where n L is a positive integer less than 20, i∈[0,n L -1], W is the width of the current texture video frame, and H is the height of the current texture video frame;
②-7、利用当前纹理视频帧的加权联合运动图M的nL层加权联合运动图,提取当前纹理视频帧的运动视觉注意的分布图SM,记SM中坐标为(x,y)的像素的像素值为sm(x,y),SM=FM,其中, s,c∈[0,nL-1],s=c+δ,δ={-3,-2,-1,1,2,3},为归一化至-1区间的归一化函数,符号“||”为绝对值运算符号,M(c)为第c层加权联合运动图,M(s)为第s层加权联合运动图,符号为M(c)与M(s)进行跨层级作差运算符,如果c<s,则将M(s)上采样至与M(c)具有相同分辨率的图像上,然后将M(c)的各个像素与上采样后的M(s)相对应像素分别进行作差,如果c>s,则将M(c)上采样至与M(s)具有相同分辨率的图像上,然后将M(s)的各个像素与上采样后的M(c)相对应像素分别进行作差,符号为M(c)与M(s)进行跨层级相加运算符,如果c<s,则将M(s)上采样至与M(c)具有相同分辨率的图像上,然后将M(c)的各个像素与上采样后的M(s)相对应像素分别进行求和,如果c>s,则将M(c)上采样至与M(s)具有相同分辨率的图像上,然后将M(s)的各个像素与上采样后的M(c)相对应像素分别进行求和。2.-7. Utilize the weighted joint motion map of the n L layer weighted joint motion map M of the weighted joint motion map M of the current texture video frame to extract the distribution map S M of the motion visual attention of the current texture video frame, and record the coordinates in S M as (x, y) The pixel value of the pixel is s m (x, y), S M =F M , where, s, c ∈ [0, n L -1], s = c + δ, δ = {-3, -2, -1, 1, 2, 3}, to normalize to The normalization function of the -1 interval, the symbol "||" is the absolute value operation symbol, M(c) is the weighted joint motion map of the c-th layer, M(s) is the weighted joint motion map of the s-th layer, the symbol Perform a cross-level difference operator for M(c) and M(s), if c<s, then upsample M(s) to an image with the same resolution as M(c), and then M(c ) and the corresponding pixels of the upsampled M(s) are respectively differentiated. If c>s, then M(c) is upsampled to an image with the same resolution as M(s), and then The difference between each pixel of M(s) and the corresponding pixel of M(c) after upsampling is respectively performed, and the sign Perform a cross-level addition operator for M(c) and M(s), if c<s, then upsample M(s) to an image with the same resolution as M(c), and then M(c ) and the corresponding pixels of the upsampled M(s) are summed separately, if c>s, then M(c) is upsampled to an image with the same resolution as M(s), and then Each pixel of M(s) is summed with the corresponding pixel of M(c) after upsampling.
所述的步骤②-4中设定的第一阈值T1=1。The first threshold T 1 set in the step ②-4=1.
所述的步骤③中的深度视觉注意检测方法的具体过程为:The concrete process of the deep visual attention detection method in described step 3. is:
③-1、对当前深度视频帧进行高斯金字塔分解,分解成nL层深度视频帧,记当前深度视频帧高斯金字塔分解后得到的第i层深度视频帧为D(i),第i层深度视频帧D(i)的宽和高分别为W/2i和H/2i,其中,nL为小于20的正整数,i∈[0,nL-1],W为当前深度视频帧的宽,H为当前深度视频帧的高;③-1. Decompose the current depth video frame into a Gaussian pyramid, and decompose it into n L -layer depth video frames. Record the i-th layer depth video frame obtained after the Gaussian pyramid decomposition of the current depth video frame as D(i), and the i-th layer depth The width and height of the video frame D(i) are W/2 i and H/2 i respectively, where n L is a positive integer less than 20, i∈[0,n L -1], and W is the current depth video frame width, H is the height of the current depth video frame;
③-2、利用当前深度视频帧的nL层深度视频帧,提取当前深度视频帧的深度特征图,记为FD,其中,s,c ∈[0,nL-1],s=c+δ,δ={-3,-2,-1,1,2,3},为归一化至区间的归一化函数,符号“||”为绝对值运算符号,D(c)为第c层深度视频帧,D(s)为第s层深度视频帧,符号为D(c)与D(s)进行跨层级作差运算符,如果c<s,则将D(s)上采样至与D(c)具有相同分辨率的图像上,然后将D(c)的各个像素与上采样后的D(s)相对应像素分别进行作差,如果c>s,则将D(c)上采样至与D(s)具有相同分辨率的图像上,然后将D(s)的各个像素与上采样后的D(c)相对应像素分别进行作差,符号为D(c)与D(s)进行跨层级相加运算符,如果c<s,则将D(s)上采样至与D(c)具有相同分辨率的图像上,然后将D(c)的各个像素与上采样后的D(s)相对应像素分别进行求和,如果c>s,则将D(c)上采样至与D(s)具有相同分辨率的图像上,然后将D(s)的各个像素与上采样后的D(c)相对应像素分别进行求和;③-2. Utilize the n L depth video frames of the current depth video frame to extract the depth feature map of the current depth video frame, denoted as F D , in, s, c ∈ [0, n L -1], s = c + δ, δ = {-3, -2, -1, 1, 2, 3}, to normalize to The normalization function of the interval, the symbol "||" is the absolute value operation symbol, D(c) is the depth video frame of the c-th layer, D(s) is the depth video frame of the s-th layer, the symbol Perform a cross-level difference operator for D(c) and D(s), if c<s, then upsample D(s) to an image with the same resolution as D(c), and then use D(c Each pixel of ) corresponds to the upsampled D(s) The pixels are respectively differentiated. If c>s, D(c) is up-sampled to an image with the same resolution as D(s), and then each pixel of D(s) is combined with the up-sampled D(c ) for the corresponding pixels to make a difference respectively, and the symbol Perform a cross-level addition operator for D(c) and D(s), if c<s, then upsample D(s) to an image with the same resolution as D(c), and then D(c ) and the corresponding pixels of the upsampled D(s) are summed separately, if c>s, then D(c) is upsampled to an image with the same resolution as D(s), and then Each pixel of D(s) is summed with the corresponding pixel of D(c) after upsampling;
③-3、采用公知的0度、π/4度、π/2度和3π/4度方向Gabor滤波器对当前深度视频帧作卷积运算,以提取0度、π/4度、π/2度和3π/4度方向的四个方向分量,得到当前深度视频帧的四个方向分量图,四个方向分量图分别表示为O0 D、Oπ/4 D、Oπ/2 D和O3π/4 D;对当前深度视频帧的O0 D方向分量图、Oπ/4 D方向分量图、Oπ/2 D方向分量图和O3π/4 D方向分量图分别进行高斯金字塔分解,各分解成nL层方向分量图,记θ度方向的方向分量图经高斯金字塔分解后得到的第i层方向分量图为Oθ D(i),Oθ D(i)的宽和高分别为W/2i和H/2i,其中,θ∈{0,π/4,π/2,3π/4}i∈[0,nL-1],W为当前深度视频帧的宽,H为当前深度视频帧的高;③-3. Use the well-known 0 degree, π/4 degree, π/2 degree and 3π/4 degree direction Gabor filter to perform convolution operation on the current depth video frame to extract 0 degree, π/4 degree, π/ The four direction components of the 2 degree and 3π/4 degree directions are obtained to obtain the four direction component maps of the current depth video frame, and the four direction component maps are represented as O 0 D , O π/4 D , O π/2 D and O 3π/4 D ; Gaussian pyramid decomposition is performed on the O 0 D direction component map, O π/4 D direction component map, O π/2 D direction component map and O 3π/4 D direction component map of the current depth video frame , each is decomposed into n L layer direction component graphs, and the i-th layer direction component graph obtained after Gaussian pyramid decomposition of the direction component graph in the θ degree direction is O θ D (i), the width and height of O θ D (i) are W/2 i and H/2 i respectively, where, θ∈{0, π/4, π/2, 3π/4}i∈[0, n L -1], W is the width of the current depth video frame , H is the height of the current depth video frame;
③-4、利用当前深度视频帧的各度方向的方向分量图的nL层方向分量图,提取当前深度视频帧的初步深度方向特征图,记为F′DO,
③-5、采用公知的形态学膨胀算法以大小为w1×h1的块为基本膨胀单元对当前深度视频帧的初步深度方向特征图F′DO进行n1次膨胀操作,得到当前深度视频帧的深度方向特征图,记为FDO;③-5. Using the known morphological dilation algorithm, take the block of size w 1 ×h 1 as the basic dilation unit to perform n 1 dilation operations on the preliminary depth direction feature map F′ DO of the current depth video frame to obtain the current depth video The depth direction feature map of the frame, denoted as F DO ;
③-6、利用当前深度视频帧的深度特征图FD和深度方向特征图FDO,获取当前深度视频帧的初步深度视觉注意的分布图,记为S′D,记S′D中坐标为(x,y)的像素的像素值为s′d(x,y),其中,为归一化至区间的归一化函数;③-6. Utilize the depth feature map F D and the depth direction feature map F DO of the current depth video frame to obtain the distribution map of the preliminary depth visual attention of the current depth video frame, denoted as S′ D , Note that the pixel value of the pixel whose coordinates are (x, y) in S′ D is s′ d (x, y), where, to normalize to The normalization function of the interval;
③-7、利用当前深度视频帧的初步深度视觉注意的分布图S′D,获取当前深度视频帧与当前纹理视频帧联合展现的三维视频图像的深度视觉注意的分布图SD,记SD中坐标为(x,y)的像素的像素值为sd(x,y),sd(x,y)=s′d(x,y)·g(x,y),其中,
所述的步骤③-5中w1=8,h1=8,n1=2,所述的步骤③-7中设定的第二阈值b为16。In the step ③-5, w 1 =8, h 1 =8, n 1 =2, and the second threshold b set in the step ③-7 is 16.
所述的步骤④中的基于深度感知的视觉注意融合方法的具体过程为:The specific process of the visual attention fusion method based on depth perception in described step 4. is:
④-1、通过Q(d(x,y))=d(x,y)+γ对当前深度视频帧进行尺度变换,其中,γ为值在范围内的系数,d(x,y)表示当前深度视频帧中坐标为(x,y)的像素的像素值,Q(d(x,y))表示尺度变换后的当前深度视频帧中坐标为(x,y)的像素的像素值;④-1. Carry out scale transformation to the current depth video frame by Q(d(x, y))=d(x, y)+γ, wherein, γ is the value in Coefficients within the range, d(x, y) represents the pixel value of the pixel whose coordinates are (x, y) in the current depth video frame, Q(d(x, y)) represents the coordinates in the current depth video frame after scale transformation is the pixel value of the pixel at (x, y);
④-2、利用尺度变换后的当前深度视频帧、当前深度视频帧及当前深度视频帧与当前纹理视频帧联合展现的三维视频图像的深度视觉注意的分布图SD、当前纹理视频帧的运动视觉注意的分布图SM以及当前纹理视频帧的静态图像域视觉注意的分布图SI,获取三维视觉注意的分布图S,记三维视觉注意的分布图S中坐标为(x,y)的像素的像素值为s(x,y),其中,KD、KM和KI分别SD、SM以及SI的加权系数,加权系数满足条件:
所述的步骤⑤中对三维视觉注意的分布图S进行阈值化和宏块化后处理的具体过程为:In the described step 5., the distribution map S of three-dimensional visual attention is carried out the specific process of thresholding and macroblocking post-processing as:
⑤-1、记三维视觉注意的分布图S中坐标为(x,y)的像素的像素值为s(x,y),定义第三阈值TS,
⑤-2、将初步二值掩模图像分割成(W/w2)×(H/h2)个尺寸大小为w2×h2的块,且块与块之间互不重叠,记横坐标为u且纵坐标为v的块为Bu,v,其中,u∈[0,W/w2-1],v∈[0,H/h2-1],根据初步二值掩模图像中的各个块确定当前纹理视频帧中的对应的各个块中的像素为感兴趣像素还是非感兴趣像素,对于块Bu,v,判断块Bu,v中标记为感兴趣像素的像素的个数是否大于设定的第四阈值Tb,其中,0≤Tb≤w2×h2,如果是,则将当前纹理视频帧中与块Bu,v对应的块中的所有像素标记为感兴趣像素,并将块Bu,v对应的块作为感兴趣区域块,否则,将当前纹理视频帧中与块Bu,v对应的块中的所有像素标记为非感兴趣像素,并将块Bu,v对应的块作为非感兴趣区域块,得到当前纹理视频帧的初步感兴趣区域掩模图像,该初步感兴趣区域掩模图像由感兴趣区域块和非感兴趣区域块组成;⑤-2. Divide the preliminary binary mask image into (W/w 2 )×(H/h 2 ) blocks with a size of w 2 ×h 2 , and the blocks do not overlap each other, mark horizontally The block with coordinate u and ordinate v is B u, v , where u ∈ [0, W/w 2 -1], v ∈ [0, H/h 2 -1], according to the preliminary binary mask Each block in the image determines whether the pixel in the corresponding block in the current texture video frame is a pixel of interest or a non-interest pixel. For block B u, v , determine the pixel marked as a pixel of interest in block B u, v Whether the number of is greater than the set fourth threshold T b , wherein, 0≤T b ≤w 2 ×h 2 , if yes, all pixels in the block corresponding to the block B u, v in the current texture video frame Mark as a pixel of interest, and use the block corresponding to the block B u, v as the region of interest block, otherwise, mark all the pixels in the block corresponding to the block B u, v in the current texture video frame as non-interest pixels, And the block corresponding to block B u, v is regarded as the non-interest area block, obtains the initial interest area mask image of the current texture video frame, and the preliminary interest area mask image is composed of the interest area block and the non-interest area block composition;
⑤-3、将初步感兴趣区域掩模图像中与感兴趣区域块最相邻的非感兴趣区域块中的所有像素标记为第NR级过渡感兴趣区域,更新初步感兴趣区域掩模图像;然后,将更新后的初步感兴趣区域掩模图像中与第NR级过渡感兴趣区域最邻近的非感兴趣区域块中的所有像素标记为第NR-1级过渡感兴趣区域,递归更新初步感兴趣区域掩模图像;再重复递归上述过程,直至标记到第1级过渡感兴趣区域;最后得到当前纹理视频帧的最终的感兴趣区域掩模图像,该最终的感兴趣区域掩模图像由感兴趣区域块、NR级过渡感兴趣区域和非感兴趣区域块组成;⑤-3. Mark all the pixels in the non-region of interest block closest to the region of interest block in the preliminary region of interest mask image as the N R transition region of interest, and update the preliminary region of interest mask image ; Then, mark all the pixels in the non-ROI block closest to the NRth- level transitional ROI in the updated preliminary ROI mask image as the NR -1th transitional ROI, recursively Update the initial ROI mask image; repeat the recursive above process until the first-level transition ROI is marked; finally get the final ROI mask image of the current texture video frame, the final ROI mask The image is composed of ROI blocks, NR- level transition ROIs and non-ROI blocks;
⑤-4、记最终的感兴趣区域掩模图像中坐标为(x,y)的像素的像素值为r(x,y),将最终的感兴趣区域掩模图像中非感兴趣区域块中的所有像素的像素值置为r(x,y)=255,将最终的感兴趣区域掩模图像中NR级过渡感兴趣区域中的所有像素的像素值置为
所述的步骤⑤-2中的w2=16,h2=16,设定的第四阈值Th=50。In the step ⑤-2, w 2 =16, h 2 =16, and the set fourth threshold T h =50.
与现有技术相比,本发明的优点在于联合利用了时间上同步的纹理视频帧和纹理视频帧对应的深度视频帧,首先通过提取纹理视频帧的静态图像域视觉注意,获取纹理视频帧的静态图像域视觉注意的分布图,通过时间上连续的纹理视频帧提取运动视觉注意,获取纹理视频帧的运动视觉注意的分布图,通过提取深度视频帧的深度视觉注意,获取深度视频帧与纹理视频帧联合展现的三维视频图像的深度视觉注意的分布图,然后利用已得到的静态图像域视觉注意的分布图、运动视觉注意的分布图及深度视觉注意的分布图以及深度信息,经过基于深度感知的融合方法得到符合人眼立体视觉特性的三维(立体)视觉注意的分布图,再经过阈值化和宏块化后处理操作得到最终的符合人眼立体感知的视频感兴趣区域及其对应的感兴趣区域与非感兴趣区域的掩模图像。该方法提取的感兴趣区域融合了静态图像域视觉注意、运动视觉注意和深度视觉注意,有效抑制各视觉注意提取内在的单一性和不准确性,解决了静态图像域视觉注意中的复杂背景引起的噪声问题,解决了运动视觉注意无法提取局部运动和运动幅度小的感兴趣区域,从而提高计算精度,增强算法的稳定性,能够从纹理复杂的背景和运动环境中提取出感兴趣区域。另外,通过该方法获取的感兴趣区域除符合人眼对静态纹理视频帧的视觉感兴趣特性和人眼对运动对象感兴趣的视觉特性外,还符合在立体视觉中对深度感强或距离近的对象感兴趣的深度感知特性,符合人眼立体视觉的语义特征。Compared with the prior art, the present invention has the advantage of jointly utilizing the temporally synchronized texture video frame and the depth video frame corresponding to the texture video frame, firstly by extracting the visual attention of the static image domain of the texture video frame, and obtaining the The distribution map of visual attention in the static image domain, extracting motion visual attention through temporally continuous texture video frames, obtaining the distribution map of motion visual attention of texture video frames, and obtaining depth video frames and texture by extracting depth visual attention of depth video frames The depth visual attention distribution map of the 3D video image jointly presented by the video frames, and then using the obtained static image domain visual attention distribution map, motion visual attention distribution map and depth visual attention distribution map and depth information, through depth-based The perceptual fusion method obtains the distribution map of three-dimensional (stereo) visual attention that conforms to the characteristics of human stereo vision, and then after thresholding and macroblocking post-processing operations, the final video region of interest and its corresponding sense of interest conforming to human stereo perception are obtained. Mask images of regions of interest and non-interest regions. The region of interest extracted by this method combines static image domain visual attention, motion visual attention and depth visual attention, which effectively suppresses the inherent singleness and inaccuracy of each visual attention extraction, and solves the problem of complex background in static image domain visual attention. The noise problem solves the problem that motion visual attention cannot extract local motion and the region of interest with small motion range, thereby improving the calculation accuracy, enhancing the stability of the algorithm, and being able to extract the region of interest from the background with complex texture and the moving environment. In addition, the region of interest obtained by this method not only conforms to the visual characteristics of the human eye for static texture video frames and the visual characteristics of the human eye for moving objects, but also conforms to the strong sense of depth or short distance in stereo vision. The depth perception characteristics of the object of interest conform to the semantic characteristics of human stereo vision.
附图说明 Description of drawings
图1a为测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧;Figure 1a is the color video frame at time t in the test sequence "Ballet" two-dimensional color video;
图1b为测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧;Figure 1b is the color video frame at time t in the test sequence "Door Flower" two-dimensional color video;
图2a为测试序列“Ballet”二维彩色视频对应的深度视频中t时刻的深度视频帧;Figure 2a is the depth video frame at time t in the depth video corresponding to the test sequence "Ballet" two-dimensional color video;
图2b为测试序列“Door Flower”二维彩色视频对应的深度视频中t时刻的深度视频帧;Figure 2b is the depth video frame at time t in the depth video corresponding to the test sequence "Door Flower" two-dimensional color video;
图3为本发明方法的总体流程框图;Fig. 3 is the overall flow chart of the inventive method;
图4为采用公知的静态图像视觉注意检测方法检测当前纹理视频帧的静态图像域视觉注意的流程框图;Fig. 4 is the flow chart diagram that adopts known static image visual attention detection method to detect the static image domain visual attention of current texture video frame;
图5为运动视觉注意检测方法的流程框图;Fig. 5 is the flowchart of motion visual attention detection method;
图6为深度视觉注意检测方法的流程框图;Fig. 6 is the flowchart of deep visual attention detection method;
图7a为测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧的亮度特征图;Figure 7a is a luminance feature map of the color video frame at time t in the test sequence "Ballet" two-dimensional color video;
图7b为测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧的色度特征图;Figure 7b is a chromaticity feature map of the color video frame at time t in the test sequence "Ballet" two-dimensional color video;
图7c为测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧的方向特征图;Figure 7c is the direction feature map of the color video frame at time t in the test sequence "Ballet" two-dimensional color video;
图8a为测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧的静态图像域视觉注意的分布图;Figure 8a is a distribution diagram of visual attention in the static image domain of the color video frame at moment t in the test sequence "Ballet" two-dimensional color video;
图8b为测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧的运动视觉注意的分布图;Fig. 8 b is the distribution diagram of the motion visual attention of the color video frame at moment t in the test sequence "Ballet" two-dimensional color video;
图8c为测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧与对应的深度视频帧联合展现的三维视频图像的深度视觉注意的分布图;Fig. 8c is a distribution diagram of the depth visual attention of the three-dimensional video image jointly presented by the color video frame and the corresponding depth video frame at time t in the test sequence "Ballet" two-dimensional color video;
图9为测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧及对应的深度视频帧经本发明处理后得到的三维视觉注意的分布图;Fig. 9 is the distribution diagram of the three-dimensional visual attention obtained after the color video frame and the corresponding depth video frame are processed by the present invention at time t in the test sequence "Ballet" two-dimensional color video;
图10a为测试序列“Ballet”的t时刻的纹理视频帧的经本发明提取的最终的感兴趣区域掩模图像;Fig. 10a is the final ROI mask image extracted by the present invention of the texture video frame at time t of the test sequence "Ballet";
图10b为测试序列“Ballet”的t时刻的纹理视频帧的经本发明提取的感兴趣区域;Fig. 10b is the region of interest extracted by the present invention of the texture video frame at time t of the test sequence "Ballet";
图11a为测试序列“Ballet”的t时刻的纹理视频帧的经传统仅依据静态图像域视觉注意线索感兴趣区域提取方法提取的感兴趣区域;Fig. 11a is the region of interest extracted by the traditional region of interest extraction method based only on visual attention clues in the static image domain of the texture video frame at time t of the test sequence "Ballet";
图11b为测试序列“Ballet”的t时刻的纹理视频帧的经传统仅依据运动视觉注意线索感兴趣区域提取方法提取的感兴趣区域;Figure 11b is the region of interest extracted by the traditional region of interest extraction method based only on motion visual attention clues of the texture video frame at time t of the test sequence "Ballet";
图11c为测试序列“Ballet”的t时刻的纹理视频帧的经传统静态图像域视觉注意和运动视觉注意联合感兴趣区域提取方法提取的感兴趣区域;Figure 11c is the region of interest extracted by the traditional static image domain visual attention and motion visual attention joint region of interest extraction method of the texture video frame at time t of the test sequence "Ballet";
图12a为测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧的亮度特征图;Figure 12a is a luminance feature map of the color video frame at time t in the test sequence "Door Flower" two-dimensional color video;
图12b为测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧的色度特征图;Figure 12b is a chromaticity feature map of the color video frame at time t in the test sequence "Door Flower" two-dimensional color video;
图12c为测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧的方向特征图;Figure 12c is the direction feature map of the color video frame at time t in the test sequence "Door Flower" two-dimensional color video;
图13a为测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧的静态图像域视觉注意的分布图;Figure 13a is a distribution diagram of visual attention in the static image domain of the color video frame at time t in the test sequence "Door Flower" two-dimensional color video;
图13b为测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧的运动视觉注意的分布图;Figure 13b is a distribution diagram of the motion visual attention of the color video frame at moment t in the test sequence "Door Flower" two-dimensional color video;
图13c为测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧与对应的深度视频帧联合展现的三维视频图像的深度视觉注意的分布图;Figure 13c is a distribution diagram of the depth visual attention of the three-dimensional video image jointly presented by the color video frame and the corresponding depth video frame at time t in the test sequence "Door Flower" two-dimensional color video;
图14为测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧及对应的深度视频帧经本发明处理后得到的三维视觉注意的分布图;Fig. 14 is the distribution diagram of the three-dimensional visual attention obtained after the color video frame and the corresponding depth video frame are processed by the present invention at time t in the two-dimensional color video of the test sequence "Door Flower";
图15a为测试序列“Door Flower”的t时刻的纹理视频帧的经本发明提取的最终的感兴趣区域掩模图像;Fig. 15a is the final ROI mask image extracted by the present invention for the texture video frame at time t of the test sequence "Door Flower";
图15b为测试序列“Door Flower”的t时刻的纹理视频帧的经本发明提取的感兴趣区域;Fig. 15b is the region of interest extracted by the present invention of the texture video frame at time t of the test sequence "Door Flower";
图16a为测试序列“Door Flower”的t时刻的纹理视频帧的经传统仅依据静态图像域视觉注意线索感兴趣区域提取方法提取的感兴趣区域;Figure 16a is the region of interest extracted by the traditional method of extracting the region of interest based only on the visual attention clues of the static image domain for the texture video frame at time t of the test sequence "Door Flower";
图16b为测试序列“Door Flower”的t时刻的纹理视频帧的经传统仅依据运动视觉注意线索感兴趣区域提取方法提取的感兴趣区域;Figure 16b is the region of interest extracted by the traditional method of extracting the region of interest only based on the motion visual attention clues of the texture video frame at time t of the test sequence "Door Flower";
图16c为测试序列“Door Flower”的t时刻的纹理视频帧的经静态图像域视觉注意和运动视觉注意联合感兴趣区域提取方法提取的感兴趣区域。Fig. 16c is the region of interest extracted by the joint region of interest extraction method of static image domain visual attention and motion visual attention of the texture video frame at time t of the test sequence "Door Flower".
具体实施方式 Detailed ways
以下结合附图实施例对本发明作进一步详细描述。The present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments.
本发明的一种基于视觉注意的视频感兴趣区域的提取方法,主要联合利用了时间上同步的纹理视频的信息和深度视频的信息来提取视频感兴趣区域,在本实施例中纹理视频主要采用二维彩色视频,纹理视频以测试序列“Ballet”二维彩色视频和“Door Flower”二维彩色视频为例,图1a给出了测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧,图1b给出了测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧,图2a为测试序列“Ballet”二维彩色视频对应的深度视频中t时刻的深度视频帧,图2b为测试序列“Door Flower”二维彩色视频对应的深度视频中t时刻的深度视频帧,二维彩色视频对应的深度视频中各时刻的深度视频帧为ZD比特深度表示的灰度图,灰度图的灰度值表示深度视频帧中各像素所表示的对象到相机的相对距离。纹理视频中各时刻的纹理视频帧的尺寸大小定义为W×H,而对于纹理视频对应的深度视频中各时刻的深度视频帧,若深度视频帧的尺寸大小与纹理视频帧的尺寸大小不相同,则一般采用现有的尺度变换和插值等方法将深度视频帧的尺寸大小设置为与纹理视频帧相同的尺寸大小,即也为W×H,W为纹理视频中各时刻的纹理视频帧的宽或深度视频中各时刻的深度视频帧的宽,H为纹理视频中各时刻的纹理视频帧的高或深度视频中各时刻的深度视频帧的高,将深度视频帧的尺寸大小设置成与纹理视频帧的尺寸大小相同,目的是为了更方便地提取视频感兴趣区域。A method for extracting a region of interest in a video based on visual attention in the present invention mainly utilizes the information of the texture video and the information of the depth video synchronized in time to extract the region of interest in the video. In this embodiment, the texture video mainly uses Two-dimensional color video, texture video Taking the test sequence "Ballet" two-dimensional color video and "Door Flower" two-dimensional color video as examples, Figure 1a shows the color video frame at time t in the test sequence "Ballet" two-dimensional color video , Figure 1b shows the color video frame at time t in the test sequence "Door Flower" two-dimensional color video, Figure 2a shows the depth video frame at time t in the depth video corresponding to the test sequence "Ballet" two-dimensional color video, Figure 2b is the depth video frame at time t in the depth video corresponding to the two-dimensional color video of the test sequence "Door Flower ". The grayscale value of the degree map represents the relative distance of the object represented by each pixel in the depth video frame to the camera. The size of the texture video frame at each moment in the texture video is defined as W×H, and for the depth video frame at each moment in the depth video corresponding to the texture video, if the size of the depth video frame is different from the size of the texture video frame , then the existing scale transformation and interpolation methods are generally used to set the size of the depth video frame to the same size as the texture video frame, which is also W×H, and W is the texture video frame at each moment in the texture video. The width of the depth video frame at each moment in the wide or depth video, H is the height of the texture video frame at each moment in the texture video or the height of the depth video frame at each moment in the depth video, and the size of the depth video frame is set to be the same as The size of the texture video frames is the same, the purpose is to extract the video region of interest more conveniently.
本发明方法的总体流程框图如图3所示,具体包括以下步骤:The overall block diagram of the inventive method as shown in Figure 3, specifically comprises the following steps:
①将二维彩色视频定义为纹理视频,定义纹理视频中各时刻的纹理视频帧的尺寸大小均为W×H,W为纹理视频中各时刻的纹理视频帧的宽,H为纹理视频中各时刻的纹理视频帧的高,记纹理视频中t时刻的纹理视频帧为Ft,定义纹理视频中t时刻的纹理视频帧Ft为当前纹理视频帧,采用公知的静态图像视觉注意检测方法检测当前纹理视频帧的静态图像域视觉注意,得到当前纹理视频帧的静态图像域视觉注意的分布图,记为SI,当前纹理视频帧的静态图像域视觉注意的分布图SI的尺寸大小为W×H且其为ZS比特深度表示的灰度图,该灰度图中某一像素的像素值越大表示人眼对当前纹理视频帧的对应像素的相对注意程度越高,像素值越小表示人眼对当前纹理视频帧的相对注意程度越低。①Define the two-dimensional color video as texture video, and define the size of the texture video frame at each moment in the texture video as W×H, where W is the width of the texture video frame at each moment in the texture video, and H is the The height of the texture video frame at time, record the texture video frame at time t in the texture video as F t , define the texture video frame F t at time t in the texture video as the current texture video frame, and use the known static image visual attention detection method to detect The static image domain visual attention of the current texture video frame obtains the distribution map of the static image domain visual attention of the current texture video frame, which is denoted as S I , and the size of the distribution map S I of the static image domain visual attention of the current texture video frame is: W×H and it is a grayscale image represented by ZS bit depth. The larger the pixel value of a certain pixel in the grayscale image, the higher the relative attention of the human eye to the corresponding pixel of the current texture video frame, and the higher the pixel value is. A small value means that the human eye pays less attention to the current texture video frame.
在此具体实施例中,采用公知的静态图像视觉注意检测方法检测当前纹理视频帧的静态图像域视觉注意的流程框图如图4所示,在图4中每个矩形表示一种数据处理过程,每个菱形分别示意一幅图像,不同尺寸的菱形表示不同分辨率的图像,是相应操作的输入和输出数据;当前纹理视频帧为RGB格式的图像,图像中的每个像素由R、G和B三个颜色通道表示,首先将当前纹理视频帧的每个像素的各颜色通道分量线性变换,分解为一个亮度分量图和两个色度分量图即红绿分量图和蓝黄分量图,亮度分量图、红绿分量图及蓝黄分量图分别记为I、RG及BY,亮度分量图I在(x,y)坐标的像素值表示为Ix,y=(rx,y+gx,y+bx,y)/3,其中,Ix,y表示亮度分量在(x,y)坐标的像素值,rx,y、gx,y、bx,y分别为当前纹理视频帧在(x,y)坐标的RGB三个颜色通道的像素的像素值,红绿分量图RG、蓝黄分量图BY两个色度分量图分别在(x,y)坐标的像素值分别表示为:In this specific embodiment, adopt known static image visual attention detection method to detect the flow chart of the still image domain visual attention of current texture video frame as shown in Figure 4, in Figure 4 each rectangle represents a kind of data processing process, Each rhombus represents an image, and diamonds of different sizes represent images of different resolutions, which are the input and output data of the corresponding operation; the current texture video frame is an image in RGB format, and each pixel in the image is composed of R, G and B three color channel representations, first linearly transform each color channel component of each pixel of the current texture video frame, and decompose it into a luminance component map and two chrominance component maps, namely the red-green component map and the blue-yellow component map, the brightness Component diagram, red-green component diagram and blue-yellow component diagram are denoted as I, RG and BY respectively, and the pixel value of luminance component diagram I in (x, y) coordinate is represented as I x, y =(r x, y +g x , y +b x, y )/3, wherein, I x, y represent the pixel value of the brightness component at (x, y) coordinates, r x, y , g x, y , b x, y are respectively the current texture video The pixel values of the pixels of the RGB three color channels of the frame at the (x, y) coordinates, and the pixel values of the two chromaticity component maps of the red-green component map RG and the blue-yellow component map BY respectively at the (x, y) coordinates represent for:
测试序列“Ballet”和“Door Flower”的各个图像的尺寸大小为1024×768,测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧的亮度特征图、色度特征图和方向特征图分别如图7a、图7b和图7c所示;测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧的亮度特征图、色度特征图和方向特征图分别如图12a、图12b和图12c所示。在此具体实施例中,ZS=8,即静态图像域视觉注意的分布图SI的每个像素采用8比特深度表示,测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧的静态图像域视觉注意的分布图如图8a所示;测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧的静态图像域视觉注意的分布图如图13a所示。在此,静态图像域视觉注意检测方法还可采用其他公知的视觉注意检测方法。The size of each image of the test sequence "Ballet" and "Door Flower" is 1024×768, the luminance feature map, chrominance feature map and direction feature map of the color video frame at time t in the test sequence "Ballet" two-dimensional color video As shown in Figure 7a, Figure 7b and Figure 7c respectively; the brightness feature map, chromaticity feature map and direction feature map of the color video frame at time t in the test sequence "Door Flower" two-dimensional color video are shown in Figure 12a and Figure 12b respectively and shown in Figure 12c. In this specific embodiment, Z S =8, that is, each pixel of the distribution map S I of visual attention in the static image domain adopts 8-bit depth representation, and the color video frame at time t in the test sequence "Ballet" two-dimensional color video The distribution of visual attention in the static image domain is shown in Figure 8a; the distribution of visual attention in the static image domain of the color video frame at time t in the test sequence "Door Flower" two-dimensional color video is shown in Figure 13a. Here, the visual attention detection method in the static image domain may also use other known visual attention detection methods.
②采用运动视觉注意检测方法检测当前纹理视频帧的运动视觉注意,得到当前纹理视频帧的运动视觉注意的分布图,记为SM,当前纹理视频帧的运动视觉注意的分布图SM的尺寸大小为W×H且其为ZS比特深度表示的灰度图,该灰度图中某一像素的像素值越大表示人眼对当前纹理视频帧的对应像素的相对运动注意程度越高,像素值越小表示人眼对当前纹理视频帧的对应像素的相对运动注意程度越低。② Use the motion visual attention detection method to detect the motion visual attention of the current texture video frame, and obtain the distribution map of the motion visual attention of the current texture video frame, denoted as S M , the size of the motion visual attention distribution map S M of the current texture video frame The size is W×H and it is a grayscale image represented by ZS bit depth. The larger the pixel value of a certain pixel in the grayscale image, the higher the relative motion attention of the human eye to the corresponding pixel of the current texture video frame, A smaller pixel value indicates that human eyes pay less attention to the relative motion of the corresponding pixel of the current texture video frame.
在此具体实施例中,运动视觉注意检测方法的流程框图如图5所示,该运动视觉注意检测方法的具体过程为:In this specific embodiment, the flow chart of the motion visual attention detection method is as shown in Figure 5, and the specific process of the motion visual attention detection method is:
②-1、记纹理视频中与当前纹理视频帧时间上连续的t+j时刻的纹理视频帧为Ft+j,记纹理视频中与当前纹理视频帧时间上连续的t-j时刻的纹理视频帧为Ft-j,其中,j∈(0,NF/2],NF为小于10的正整数,在本实施例的具体应用过程中取NF=4,即采用当前纹理视频帧以及当前纹理视频帧的前两帧和后两帧联合提取纹理视频的运动区域。②-1. Record the texture video frame at time t+j continuous with the current texture video frame in the texture video as F t+j , record the texture video frame at the time tj time continuous with the current texture video frame in the texture video is F tj , wherein, j∈(0, NF /2], NF is a positive integer less than 10, and NF = 4 is taken in the specific application process of this embodiment, that is, the current texture video frame and the current texture The first two frames and the last two frames of the video frame jointly extract the motion region of the texture video.
②-2、采用公知的光流法计算当前纹理视频帧与t+j时刻的纹理视频帧Ft+j在水平方向的运动向量图像和竖直方向的运动向量图像,及当前纹理视频帧与t-j时刻的纹理视频帧Ft-j在水平方向的运动向量图像和竖直方向的运动向量图像,记当前纹理视频帧与t+j时刻的纹理视频帧Ft+j在水平方向的运动向量图像为Vt+j H及竖直方向的运动向量图像为Vt+j V,记当前纹理视频帧与t-j时刻的纹理视频帧Ft-j在水平方向的运动向量图像为Vt-j H及竖直方向的运动向量图像为Vt-j V,Vt+j H、Vt+j V、Vt-j H及Vt-j V的宽为W及高为H。②-2, using the known optical flow method to calculate the motion vector image of the current texture video frame and the texture video frame F t+j at the moment t+j in the horizontal direction and the motion vector image of the vertical direction, and the current texture video frame and The motion vector image of the texture video frame F tj in the horizontal direction and the motion vector image in the vertical direction of the texture video frame F tj at the moment tj, record the motion vector image of the current texture video frame and the texture video frame F t+j of the moment t +j in the horizontal direction as The motion vector image of V t+j H and the vertical direction is V t+j V , and the motion vector image of the current texture video frame and the texture video frame F tj at the moment tj in the horizontal direction is V tj H and the motion vector image of the vertical direction The motion vector image is V tj V , V t+j H , V t+j V , V tj H and V tj V have width W and height H.
②-3、将Vt+j H的绝对值与Vt+j V的绝对值叠加得到当前纹理视频帧与t+j时刻的纹理视频帧Ft+j的运动幅度图像,记为Mt+j,
②-4、利用当前纹理视频帧和t+j时刻的纹理视频帧Ft+j及t-j时刻的纹理视频帧Ft-j,提取联合运动图,记为Mj Δ,提取联合运动图Mj Δ的具体过程为:判断当前纹理视频帧与t+j时刻的纹理视频帧Ft+j的运动幅度图像Mt+j中的各个像素和当前纹理视频帧与t-j时刻的纹理视频帧Ft-j的运动幅度图像Mt-j中对应坐标的像素的运动幅度值中的最小值是否大于设定的第一阈值T1,如果是,则确定联合运动图Mj Δ中相应坐标的像素的像素值为Mt+j和Mt-j中对应坐标的像素的运动幅度值之和的平均,否则,确定联合运动图Mj Δ中相应坐标的像素的像素值为0;对于Mt+j中坐标为(x,y)的像素和Mt-j中坐标为(x,y)的像素,判断min(mt+j(x,y),mt-j(x,y))是否大于设定的第一阈值T1,如果是,则确定联合运动图Mj Δ中坐标为(x,y)的像素的像素值为否则,确定联合运动图Mj Δ中坐标为(x,y)的像素的像素值为0,其中,min()为取最小值函数。在此,第一阈值T1=1,以滤除非常微小的相机参数抖动所造成的小噪声点。②-4. Use the current texture video frame and the texture video frame F t+j at time t+j and the texture video frame F tj at time tj to extract the joint motion map, denoted as M j Δ , and extract the joint motion map M j Δ The specific process is: judging the current texture video frame and the texture video frame F t+j at the time t+j of each pixel in the motion range image M t+j and the current texture video frame and the texture video frame F tj at the time tj Whether the minimum value of the motion amplitude value of the pixel corresponding to the coordinate in the motion amplitude image M tj is greater than the set first threshold T 1 , if yes, then determine the pixel value M of the pixel corresponding to the coordinate in the joint motion map M j Δ The average of the sum of the motion amplitude values of the pixels corresponding to the coordinates in t+j and M tj , otherwise, the pixel value of the pixel corresponding to the coordinates in the joint motion map M j Δ is determined to be 0; for M t+j , the coordinates are (x , y) and the pixel whose coordinates are (x, y) in M tj , determine whether min(m t+j (x, y), m tj (x, y)) is greater than the set first threshold T 1 , if yes, determine the pixel value of the pixel with coordinates (x, y) in the joint motion map M j Δ Otherwise, it is determined that the pixel value of the pixel with coordinates (x, y) in the joint motion map M j Δ is 0, where min() is a minimum value function. Here, the first threshold T 1 =1 to filter out small noise points caused by very slight camera parameter shakes.
②-5、将在时间上与t时刻距离1时刻至NF/2时刻的各个时刻的联合运动图加权叠加得到当前纹理视频帧的加权联合运动图,记为M,记当前纹理视频帧的加权联合运动图M中坐标为(x,y)的像素的像素值为m(x,y),
在视频中,运动物体是主要的感兴趣区域,然而由于运动类型不同,人们的注意程度是不同的,将视频的运动类型主要分为以下两类情况,第一类,对于静止相机拍摄的情况,背景静止,运动物体是主要感兴趣对象;第二类,对于运动相机拍摄的情况,背景全局运动,而运动物体与相机保持相对静止或呈现于背景不一致运动的情况,此时,该运动物体仍然是感兴趣对象;针对以上分析,人们运动注意区域主要来源于该物体运动属性区别于背景环境的运动属性,是运动对比度较大的区域,因此可采用以下步骤获取运动视觉注意。In the video, the moving object is the main area of interest. However, due to the different types of motion, people's attention is different. The motion types of the video are mainly divided into the following two types. The first type is the case of still camera shooting. , the background is still, and the moving object is the main object of interest; the second type, for the case of a moving camera shooting, the background moves globally, while the moving object and the camera remain relatively still or appear in the background. It is still an object of interest; according to the above analysis, the people's motion attention area mainly comes from the motion properties of the object that are different from the motion properties of the background environment, and it is an area with a large motion contrast. Therefore, the following steps can be used to obtain motion visual attention.
②-6、对当前纹理视频帧的加权联合运动图M进行高斯金字塔分解,分解成nL层加权联合运动图,记加权联合运动图M经高斯金字塔分解后得到的第i层加权联合运动图为M(i),第i层加权联合运动图M(i)的宽和高分别为W/2i和H/2i,其中,nL为小于20的正整数,i∈[0,nL-1],第0层为最底层,第nL-1层为最高层,W为当前纹理视频帧的宽,H为当前纹理视频帧的高;在本实施例的具体应用过程中nL取值为9。②-6. Decompose the weighted joint motion map M of the current texture video frame into a Gaussian pyramid, decompose it into n L layers of weighted joint motion maps, and record the i-th layer weighted joint motion map obtained after the weighted joint motion map M is decomposed by the Gaussian pyramid is M(i), the width and height of the i-th layer weighted joint motion map M(i) are W/2 i and H/2 i respectively, where n L is a positive integer less than 20, i∈[0,n L -1], the 0th layer is the bottom layer, the nth L -1 layer is the highest layer, W is the width of the current texture video frame, and H is the height of the current texture video frame; in the specific application process of the present embodiment, n The value of L is 9.
②-7、利用当前纹理视频帧的加权联合运动图M的n层nL加权联合运动图,提取当前纹理视频帧的运动视觉注意的分布图SM,记SM中坐标为(x,y)的像素的像素值为sm(x,y),SM=FM,其中, s,c ∈[0,nL-1],s=c+δ,δ={-3,-2,-1,1,2,3},为归一化至区间的归一化函数,符号“||”为绝对值运算符号,M(c)为第c层加权联合运动图,M(s)为第s层加权联合运动图,符号为M(c)与M(s)进行跨层级作差运算符,如果c<s,则将M(s)上采样至与M(c)具有相同分辨率的图像上,然后将M(c)的各个像素与上采样后的M(s)相对应像素分别进行作差,如果c>s,则将M(c)上采样至与M(s)具有相同分辨率的图像上,然后将M(s)的各个像素与上采样后的M(c)相对应像素分别进行作差,符号为M(c)与M(s)进行跨层级相加运算符,如果c<s,则将M(s)上采样至与M(c)具有相同分辨率的图像上,然后将M(c)的各个像素与上采样后的M(s)相对应像素分别进行求和,如果c>s,则将M(c)上采样至与M(s)具有相同分辨率的图像上,然后将M(s)的各个像素与上采样后的M(c)相对应像素分别进行求和。②-7. Utilize the n-layer n L weighted joint motion map of the weighted joint motion map M of the current texture video frame to extract the distribution map S M of the motion visual attention of the current texture video frame, and record the coordinates in S M as (x, y ) pixel value of the pixel s m (x, y), S M = F M , where, s, c ∈ [0, n L -1], s = c + δ, δ = {-3, -2, -1, 1, 2, 3}, to normalize to The normalization function of the interval, the symbol "||" is the absolute value operation symbol, M(c) is the weighted joint motion map of the c-th layer, M(s) is the weighted joint motion map of the s-th layer, the symbol Perform a cross-level difference operator for M(c) and M(s), if c<s, then upsample M(s) to an image with the same resolution as M(c), and then M(c ) and the corresponding pixels of the upsampled M(s) are respectively differentiated. If c>s, then M(c) is upsampled to an image with the same resolution as M(s), and then The difference between each pixel of M(s) and the corresponding pixel of M(c) after upsampling is respectively performed, and the sign Perform a cross-level addition operator for M(c) and M(s), if c<s, then upsample M(s) to an image with the same resolution as M(c), and then M(c ) and the corresponding pixels of the upsampled M(s) are summed separately, if c>s, then M(c) is upsampled to an image with the same resolution as M(s), and then Each pixel of M(s) is summed with the corresponding pixel of M(c) after upsampling.
测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧经本步骤处理后得到的运动视觉注意的分布图如图8b所示;测试序列“Door Flower”二维彩色视频中t时刻的彩色视频帧经本步骤处理后得到的运动视觉注意的分布图如图13b所示。The distribution diagram of motion visual attention obtained after processing the color video frame at time t in the test sequence "Ballet" two-dimensional color video is shown in Figure 8b; the color video frame at time t in the test sequence "Door Flower" two-dimensional color video The distribution diagram of motion visual attention obtained after the video frames are processed in this step is shown in Fig. 13b.
③定义纹理视频对应的深度视频中各时刻的深度视频帧为ZD比特深度表示的灰度图,其0到范围的灰度值表示深度视频帧中的各个像素所表示的拍摄到的对象到拍摄相机的相对距离,灰度值0对应最大深度,灰度值对应最小深度,将深度视频中各时刻的深度视频帧的尺寸大小均设置为W×H,W为深度视频中各时刻的深度视频帧的宽,H为深度视频中各时刻的深度视频帧的高,记深度视频中t时刻的深度视频帧为Dt,定义深度视频中t时刻的深度视频帧Dt为当前深度视频帧,采用深度视觉注意检测方法检测当前深度视频帧与当前纹理视频帧联合展现的三维视频图像的深度视觉注意,得到三维视频图像的深度视觉注意的分布图,记为SD,三维视频图像的深度视觉注意的分布图SD的尺寸大小为W×H且其为ZS比特深度表示的灰度图,该灰度图中某一像素的像素值越大表示人眼对当前纹理视频帧的对应像素的相对深度注意程度越高,像素值越小表示人眼对当前纹理视频帧的相对深度注意程度越低。本实施例中,深度视频帧的每个像素由ZD=8比特深度表示,视觉注意分布图的每个像素由ZS=8比特深度表示。③Defining the depth video frame corresponding to the texture video at each moment is a grayscale image represented by the Z D bit depth, and its 0 to The grayscale value of the range indicates the relative distance from the captured object represented by each pixel in the depth video frame to the shooting camera. The grayscale value 0 corresponds to the maximum depth, and the grayscale value Corresponding to the minimum depth, the size of the depth video frame at each moment in the depth video is set to W×H, W is the width of the depth video frame at each moment in the depth video, and H is the width of the depth video frame at each moment in the depth video High, record the depth video frame at time t in the depth video as D t , define the depth video frame D t at time t in the depth video as the current depth video frame, and use the depth visual attention detection method to detect the current depth video frame and the current texture video frame The depth visual attention of the 3D video images jointly displayed, the distribution map of the depth visual attention of the 3D video images is obtained, denoted as SD , the size of the distribution map SD of the depth visual attention of the 3D video images is W×H and it is The grayscale image represented by Z S bit depth, the larger the pixel value of a pixel in the grayscale image, the higher the human eye’s attention to the relative depth of the corresponding pixel of the current texture video frame, and the smaller the pixel value, the human eye’s attention to the corresponding pixel of the current texture video frame. The lower the relative depth attention of the current texture video frame. In this embodiment, each pixel of the depth video frame is represented by Z D =8 bit depth, and each pixel of the visual attention distribution map is represented by Z S =8 bit depth.
特有的立体感是立体视频区别于传统单通道视频的主要特点,对于立体视频的视觉注意力,深度感主要通过两个方面影响着用户的视觉注意力,一方面,用户对于靠近拍摄相机阵列的景物(或物体)的感兴趣程度一般大于远离拍摄相机阵列的景物(或物体);另一方面,深度不连续区域提供给用户以强烈的深度对比。在此具体实施例中,深度视觉注意检测方法的流程框图如图6所示,该深度视觉注意检测方法的具体过程为:The unique stereoscopic effect is the main feature that distinguishes stereoscopic video from traditional single-channel video. For the visual attention of stereoscopic video, the sense of depth mainly affects the user's visual attention through two aspects. Scenes (or objects) are generally more interesting than scenes (or objects) far away from the camera array; on the other hand, the depth discontinuity region provides the user with a strong depth contrast. In this specific embodiment, the flow chart of the deep visual attention detection method is shown in Figure 6, and the specific process of the deep visual attention detection method is:
③-1、对当前深度视频帧进行高斯金字塔分解,分解成nL层深度视频帧,记当前深度视频帧高斯金字塔分解后得到第i层深度视频帧为D(i),第i层深度视频帧D(i)的宽和高分别为W/2i和H/2i,其中,nL为小于20的正整数,i∈[0,nL-1],第0层为最底层,分辨率最大D(0)=Dt,第nL-1层为最高层,分辨率最小,W为当前深度视频帧的宽,H为当前深度视频帧的高。③-1. Decompose the current depth video frame into a Gaussian pyramid, and decompose it into n L -layer depth video frames. After the Gaussian pyramid decomposition of the current depth video frame, obtain the i-th layer depth video frame as D(i), and the i-th layer depth video frame The width and height of frame D(i) are W/2 i and H/2 i respectively, where n L is a positive integer less than 20, i∈[0, n L -1], and the 0th layer is the bottom layer, The maximum resolution is D(0)=D t , the n L -1th layer is the highest layer, and the resolution is the smallest, W is the width of the current depth video frame, and H is the height of the current depth video frame.
③-2、利用当前深度视频帧的nL层深度视频帧,提取当前深度视频帧的深度特征图,记为FD,其中,s,c ∈[0,nL-1],s=c+δ,δ={-3,-2,-1,1,2,3},为归一化至区间的归一化函数,符号“||”为绝对值运算符号,D(c)为第c层深度视频帧,D(s)为第s层深度视频帧,符号为D(c)与D(s)进行跨层级作差运算符,如果c<s,则将D(s)上采样至与D(c)具有相同分辨率的图像上,然后将D(c)的各个像素与上采样后的D(s)相对应像素分别进行作差,如果c>s,则将D(c)上采样至与D(s)具有相同分辨率的图像上,然后将D(s)的各个像素与上采样后的D(c)相对应像素分别进行作差,符号为D(c)与D(s)进行跨层级相加运算符,如果c<s,则将D(s)上采样至与D(c)具有相同分辨率的图像上,然后将D(c)的各个像素与上采样后的D(s)相对应像素分别进行求和,如果c>s,则将D(c)上采样至与D(s)具有相同分辨率的图像上,然后将D(s)的各个像素与上采样后的D(c)相对应像素分别进行求和。③-2. Utilize the n L depth video frames of the current depth video frame to extract the depth feature map of the current depth video frame, denoted as F D , in, s, c ∈ [0, n L -1], s = c + δ, δ = {-3, -2, -1, 1, 2, 3}, to normalize to The normalization function of the interval, the symbol "||" is the absolute value operation symbol, D(c) is the depth video frame of the c-th layer, D(s) is the depth video frame of the s-th layer, the symbol Perform a cross-level difference operator for D(c) and D(s), if c<s, then upsample D(s) to an image with the same resolution as D(c), and then use D(c ) and the corresponding pixels of the upsampled D(s) are respectively differentiated, if c>s, then D(c) is upsampled to an image with the same resolution as D(s), and then Each pixel of D(s) is different from the corresponding pixel of D(c) after upsampling, and the symbol Perform a cross-level addition operator for D(c) and D(s), if c<s, then upsample D(s) to an image with the same resolution as D(c), and then D(c ) and the corresponding pixels of the upsampled D(s) are summed separately, if c>s, then D(c) is upsampled to an image with the same resolution as D(s), and then Each pixel of D(s) is summed with the corresponding pixel of D(c) after upsampling.
③-3、深度差异性较大的深度边缘区域给予用户更强的深度感,所以当前深度视频帧中的深度强边缘区域是深度视觉注意的另一重要感兴趣区域,因此在此采用公知的0度、π/4度、π/2度和3π/4度方向Gabor滤波器对当前深度视频帧作卷积运算,以提取0度、π/4度、π/2度和3π/4度方向的四个方向分量,得到当前深度视频帧的四个方向分量图,四个方向分量图分别表示为O0 D、Oπ/4 D、Oπ/2 D和O3π/4 D;对当前深度视频帧的O0 D方向分量图、Oπ/4 D方向分量图、Oπ/2 D方向分量图和O3π/4 D方向分量图分别进行高斯金字塔分解,各分解成nL层方向分量图,记θ度方向的方向分量图经高斯金字塔分解后得到的第i层方向分量图为Oθ D(i),Oθ D(i)的宽和高分别为W/2i和H/2i,其中,θ∈{0,π/4,π/2,3π/4}i∈[0,nL-1],第0层为最底层,
③-4、利用当前深度视频帧的各度方向的方向分量图的nL层方向分量图,提取当前深度视频帧的初步深度方向特征图,记为F′DO,
③-5、采用公知的形态学膨胀算法以大小为w1×h1的块为基本膨胀单元对当前深度视频帧的初步深度方向特征图F′DO进行n1次膨胀操作,得到当前深度视频帧的深度方向特征图,记为FDO。在本实施例中,针对“Ballet”和“Doorflower”测试序列,测试序列中各个图像的尺寸大小为1024×768,形态学膨胀的基本单元采用8×8的块,即w1×h1=8×8,膨胀次数n1=2。③-5. Using the known morphological dilation algorithm, take the block of size w 1 ×h 1 as the basic dilation unit to perform n 1 dilation operations on the preliminary depth direction feature map F′ DO of the current depth video frame to obtain the current depth video Depth direction feature map of a frame, denoted as F DO . In this embodiment, for the "Ballet" and "Doorflower" test sequences, the size of each image in the test sequence is 1024×768, and the basic unit of morphological expansion adopts 8×8 blocks, that is, w 1 ×h 1 = 8×8, the number of expansions n 1 =2.
③-6、利用当前深度视频帧的深度特征图FD和深度方向特征图FDO,获取当前深度视频帧的初步深度视觉注意的分布图,记为S′D,记S′D中坐标为(x,y)的像素的像素值为s′d(x,y),其中,为归一化至区间的归一化函数。③-6. Utilize the depth feature map F D and the depth direction feature map F DO of the current depth video frame to obtain the distribution map of the preliminary depth visual attention of the current depth video frame, denoted as S′ D , Note that the pixel value of the pixel whose coordinates are (x, y) in S′ D is s′ d (x, y), where, to normalize to Normalization function for intervals.
③-7、对于图像的左右边界区域,左视点图像具有的左图像边界,在右视点并没有与之对应的区域,所以无法在人脑中形成立体感;同理,对于右视点图像的右图像边界也难以形成立体感。所以在立体视频中,图像的左右边界区域提供的立体感较弱甚至没有立体感,是非立体视觉注意区域,所以本发明对当前深度视频帧的初步深度视觉注意的分布图S′D的边界区域进行抑制,利用当前深度视频帧的初步深度视觉注意的分布图S′D,获取当前深度视频帧与当前纹理视频帧联合展现的三维视频图像的深度视觉注意的分布图SD,记SD中坐标为(x,y)的像素的像素值为sd(x,y),sd(x,y)=s′d(x,y)·g(x,y),其中,
图8c给出了测试序列“Ballet”二维彩色视频中t时刻的彩色视频帧与对应的深度视频帧联合展现的三维视频图像的深度视觉注意的分布图;图13c给出了测试序列“DoorFlower”二维彩色视频中t时刻的彩色视频帧与对应的深度视频帧联合展现的三维视频图像的深度视觉注意的分布图。Figure 8c shows the depth visual attention distribution diagram of the 3D video image jointly presented by the color video frame and the corresponding depth video frame at time t in the test sequence "Ballet" 2D color video; Figure 13c shows the distribution diagram of the test sequence "DoorFlower "The distribution map of the depth visual attention of the 3D video image jointly presented by the color video frame and the corresponding depth video frame at time t in the 2D color video.
④采用基于深度感知的视觉注意融合方法将当前纹理视频帧的静态图像域视觉注意的分布图SI、当前纹理视频帧的运动视觉注意的分布图SM、当前深度视频帧及当前深度视频帧与当前纹理视频帧联合展现的三维视频图像的深度视觉注意的分布图SD融合,以提取符合人眼立体感知的三维视觉注意的分布图,记为S,三维视觉注意的分布图S的尺寸大小为W×H且其为ZS比特深度表示的灰度图,该灰度图中某一像素的像素值越大表示人眼对当前深度视频帧与当前纹理视频帧联合展现的三维视频图像的对应像素的相对关注程度越高,像素值越小表示人眼对当前深度视频帧与当前纹理视频帧联合展现的三维视频图像的相对关注程度越低。④Adopt the visual attention fusion method based on depth perception to combine the static image domain visual attention distribution map S I of the current texture video frame, the motion visual attention distribution map S M of the current texture video frame, the current depth video frame and the current depth video frame The depth visual attention distribution map S D of the 3D video image jointly displayed with the current texture video frame is fused to extract the 3D visual attention distribution map that conforms to the stereoscopic perception of the human eye, denoted as S, the size of the 3D visual attention distribution map S The size is W×H and it is a grayscale image represented by Z S bit depth. The larger the pixel value of a certain pixel in the grayscale image, the 3D video image jointly presented by the human eye to the current depth video frame and the current texture video frame The higher the relative attention degree of the corresponding pixel of , the smaller the pixel value is, the lower the relative attention degree of the human eye is to the 3D video image jointly presented by the current depth video frame and the current texture video frame.
在传统单通道中,运动物体相比于静止物体更容易引起观看者的注意,对于都为静止的物体,颜色鲜艳的区域、色彩或亮度对比度较大的区域、纹理方向差异性较大的区域等更容易引起观看者的注意;在立体视频中,人眼的视觉注意分布除了受到运动视觉注意和静态图像域视觉注意影响外,还受到立体视频提供给予用户特有的立体感的影响;这种立体感主要来源于我们左右眼所看到的场景的微小的位置偏差,称为视差,例如我们双眼间距约为6厘米,各眼所收到的物体影像投影到视网膜上形成的视觉影像有微小的位置偏差,这个微小偏差通过大脑自动综合为具备深度的立体图像,形成立体视觉,立体感所体现的对象相对距离信息是直接影响我们注意力选择另一重要因素。在立体视频中,深度不连续区域或深度对比度较大的区域所包含的对象能够给予用户更加强烈的深度为之差异,具有更强的立体感或深度感,是用户感兴趣的区域之一;另一方面,观看者对靠近拍摄相机(或视频观看者)的前景区域的感兴趣程度大于远离拍摄相机(或视频观看者)的区域的感兴趣程度,所以前景区域通常是立体视频观看者感兴趣区域的重要潜在区域,基于以上分析,确定影响人眼三维视觉注意的因素包括静态图像域视觉注意、运动视觉注意、深度视觉注意以及深度四个因素,因此,在此具体实施例中基于深度感知的视觉注意融合方法的具体过程为:In the traditional single channel, moving objects are more likely to attract the attention of the viewer than static objects. For all static objects, areas with bright colors, areas with large color or brightness contrast, and areas with large differences in texture directions etc. are more likely to attract the attention of the viewer; in stereoscopic video, the visual attention distribution of the human eye is not only affected by the movement visual attention and static image domain visual attention, but also by the unique stereoscopic effect provided by the stereoscopic video to the user; this The three-dimensional effect mainly comes from the slight positional deviation of the scene seen by our left and right eyes, which is called parallax. This small deviation is automatically synthesized by the brain into a stereoscopic image with depth, forming stereoscopic vision. The relative distance information of the object reflected in the stereoscopic sense is another important factor that directly affects our attention choice. In the stereoscopic video, the objects contained in the depth discontinuous area or the area with greater depth contrast can give the user a stronger depth difference, which has a stronger sense of three-dimensional or depth, and is one of the areas of interest to the user; On the other hand, the viewer is more interested in the foreground area close to the shooting camera (or video viewer) than the area far away from the shooting camera (or video viewer), so the foreground area is usually a stereoscopic video viewer. The important potential area of the region of interest, based on the above analysis, determines that the factors that affect the three-dimensional visual attention of the human eye include four factors: static image domain visual attention, motion visual attention, depth visual attention, and depth. Therefore, in this specific embodiment, based on the depth The specific process of the perceptual visual attention fusion method is as follows:
④-1、通过Q(d(x,y))=d(x,y)+γ对当前深度视频帧进行尺度变换,其中,γ为值在范围内的系数,d(x,y)表示当前深度视频帧中坐标为(x,y)的像素的像素值,Q(d(x,y))表示尺度变换后的当前深度视频帧中坐标为(x,y)的像素的像素值。④-1. Carry out scale transformation to the current depth video frame by Q(d(x, y))=d(x, y)+γ, wherein, γ is the value in Coefficients within the range, d(x, y) represents the pixel value of the pixel whose coordinates are (x, y) in the current depth video frame, Q(d(x, y)) represents the coordinates in the current depth video frame after scale transformation The pixel value of the pixel at (x, y).
④-2、利用尺度变换后的当前深度视频帧、当前深度视频帧及当前深度视频帧与当前纹理视频帧联合展现的三维视频图像的深度视觉注意的分布图SD、当前纹理视频帧的运动视觉注意的分布图SM以及当前纹理视频帧的静态图像域视觉注意的分布图SI,获取三维视觉注意的分布图S,记三维视觉注意的分布图S中坐标为(x,y)的像素的像素值为s(x,y),其中,KD、KM和KI分别SD、SM以及SI的加权系数,加权系数满足条件:
运动视觉注意、静态图像域视觉注意和深度视觉注意联合人们视觉注意都起着重要的作用,然而运动视觉注意是视频视觉注意中最重要的内容,其次是由图像域的亮度、颜色和方向引起的静态图像域视觉注意,再次之为深度视觉注意,所以在本实施例中,各视觉注意分布图均由ZS=8比特深度表示,在此具体实施例中取KD=0.15、KM=0.4和KI=0.35,深度视觉注意与运动视觉注意的相关度较小,深度视觉注意与静态图像域视觉注意的相关度也较小,静态图像域视觉注意与运动视觉注意的相关度较大,所以在此相关系数CDM、CDI和CIM分别设置为0.2、0.2和0.6,尺度变换系数γ表征纹理视频场景的景物纵深度,γ越小景物纵深越大,给予观看者的深度感越强,相反,γ越大景物纵深越小,给予观看者的深度感越弱,针对“Ballet”和“Door Flower”测试序列,由于场景的景物景深较小,因此设置尺度变换系数γ为50,针对“Ballet”测试序列t时刻的纹理视频帧及对应的深度视频帧提取得到的三维视觉注意的分布图如图9所示,针对“Door Flower”测试序列t时刻的纹理视频帧及对应的深度视频帧提取得到的三维视觉注意的分布图如图14所示。Motion visual attention, static image domain visual attention and depth visual attention all play important roles in conjunction with people's visual attention, however motion visual attention is the most important content in video visual attention, followed by brightness, color and orientation of the image domain Visual attention in the static image domain, again it is depth visual attention, so in this embodiment, each visual attention distribution map is represented by Z S =8 bit depth, in this specific embodiment take K D =0.15, K M =0.4 and K I =0.35, the correlation between depth visual attention and motion visual attention is small, the correlation between depth visual attention and static image domain visual attention is also small, and the correlation between static image domain visual attention and motion visual attention is relatively small is large, so the correlation coefficients C DM , C DI and C IM are set to 0.2, 0.2 and 0.6 respectively, and the scale transformation coefficient γ represents the depth of the scene in the texture video scene. The smaller the γ, the greater the depth of the scene, giving the viewer more On the contrary, the larger the γ, the smaller the depth of the scene, and the weaker the depth sense given to the viewer. For the "Ballet" and "Door Flower" test sequences, since the scene has a smaller depth of field, the scale transformation coefficient γ is set as 50. The distribution diagram of the three-dimensional visual attention extracted from the texture video frame and the corresponding depth video frame at the time t of the "Ballet" test sequence is shown in Figure 9. The texture video frame and the corresponding depth video frame at the time t of the "Door Flower" test sequence The distribution map of 3D visual attention obtained by extracting the depth video frame is shown in Fig. 14.
⑤对三维视觉注意的分布图S进行阈值化和宏块化后处理,获取当前纹理视频帧的最终的符合人眼立体感知的感兴趣区域。⑤Thresholding and macroblocking post-processing are performed on the distribution map S of 3D visual attention to obtain the final region of interest that conforms to the stereoscopic perception of the human eye in the current texture video frame.
在此具体实施例中,对三维视觉注意的分布图S进行阈值化和宏块化后处理的具体过程为:In this specific embodiment, the specific process of performing thresholding and macroblock post-processing on the distribution map S of 3D visual attention is as follows:
⑤-1、记三维视觉注意的分布图S中坐标为(x,y)的像素的像素值为s(x,y),定义第三阈值TS,
⑤-2、将初步二值掩模图像分割成(W/w2)×(H/h2)个尺寸大小为w2×h2的块,且块与块之间互不重叠,记横坐标为u且纵坐标为v的块为Bu,v,其中,u∈[0,W/w2-1],v∈[0,H/h2-1],根据初步二值掩模图像中的各个块确定当前纹理视频帧中的对应的各个块中的像素为感兴趣像素还是非感兴趣像素,对于块Bu,v,判断块Bu,v中标记为感兴趣像素的像素的个数是否大于设定的第四阈值Tb,其中,0≤Tb≤w2×h2,如果是,则将当前纹理视频帧中与块Bu,v对应的块中的所有像素标记为感兴趣像素,并将块Bu,v对应的块作为感兴趣区域块,否则,将当前纹理视频帧中与块Bu,v对应的块中的所有像素标记为非感兴趣像素,并将块Bu,v对应的块作为非感兴趣区域块,得到当前纹理视频帧的初步感兴趣区域掩模图像,该初步感兴趣区域掩模图像由感兴趣区域块和非感兴趣区域块组成。⑤-2. Divide the preliminary binary mask image into (W/w 2 )×(H/h 2 ) blocks with a size of w 2 ×h 2 , and the blocks do not overlap each other, mark horizontally The block with coordinate u and ordinate v is B u, v , where u ∈ [0, W/w 2 -1], v ∈ [0, H/h 2 -1], according to the preliminary binary mask Each block in the image determines whether the pixel in the corresponding block in the current texture video frame is a pixel of interest or a non-interest pixel. For block B u, v , determine the pixel marked as a pixel of interest in block B u, v Whether the number of is greater than the set fourth threshold T b , wherein, 0≤T b ≤w 2 ×h 2 , if yes, all pixels in the block corresponding to the block B u, v in the current texture video frame Mark as a pixel of interest, and use the block corresponding to the block B u, v as the region of interest block, otherwise, mark all the pixels in the block corresponding to the block B u, v in the current texture video frame as non-interest pixels, And the block corresponding to block B u, v is regarded as the non-interest area block, obtains the initial interest area mask image of the current texture video frame, and the preliminary interest area mask image is composed of the interest area block and the non-interest area block composition.
在本实施例中,测试序列“Ballet”和“Door Flower”中各图像的尺寸大小为1024×768,因此可设置块Bu,v的尺寸w2×h2为16×16,通常像素个数很少的区域不容易引起观看者的兴趣,所以在此第四阈值Tb设置为50。In this embodiment, the size of each image in the test sequence "Ballet" and "Door Flower" is 1024×768, so the size w 2 ×h 2 of the block Bu, v can be set to 16×16, usually pixels Regions with a small number are not likely to attract the interest of the viewer, so the fourth threshold T b is set to 50 here.
⑤-3、由于在感兴趣区域和非感兴趣区域之间通常不是骤然转变的,而是缓慢变化的,存在过渡区,所以本发明在感兴趣区域与非感兴趣区域之间设置NR级过渡感兴趣区域。将初步感兴趣区域掩模图像中与感兴趣区域块最相邻的非感兴趣区域块中的所有像素标记为第NR级过渡感兴趣区域,更新初步感兴趣区域掩模图像;然后,将更新后的初步感兴趣区域掩模图像中与第NR级过渡感兴趣区域最邻近的非感兴趣区域块中的所有像素标记为第NR-1级过渡感兴趣区域,递归更新初步感兴趣区域掩模图像;再重复递归上述过程,直至标记到第1级过渡感兴趣区域;最后得到当前纹理视频帧的最终的感兴趣区域掩模图像,该最终的感兴趣区域掩模图像由感兴趣区域块、NR级过渡感兴趣区域和非感兴趣区域块组成。在此具体实施例中,NR取值为2,即设置2级过渡感兴趣区域。⑤-3. Since the area of interest and the area of non-interest is usually not changed abruptly, but changes slowly, there is a transition zone, so the present invention sets NR level between the area of interest and the area of non-interest transition region of interest. Mark all the pixels in the non-region of interest block nearest to the region of interest block in the preliminary region of interest mask image as the NRth level transition region of interest, and update the preliminary region of interest mask image; then, set In the updated preliminary ROI mask image, all pixels in the non-ROI block closest to the NR- th transition ROI are marked as NR -1th transition ROIs, and the preliminary ROI is updated recursively Region mask image; repeat the recursive above-mentioned process again, until marking to the first-level transition region of interest; finally get the final region of interest mask image of the current texture video frame, the final region of interest mask image is determined by the region of interest Region block, NR level transition region of interest and non-region of interest block. In this specific embodiment, the value of NR is 2, that is, two levels of transition ROIs are set.
图10a给出了测试序列“Ballet”的t时刻的纹理视频帧的最终的感兴趣区域掩模图像;图15a给出了测试序列“Door Flower”的t时刻的纹理视频帧的最终的感兴趣区域掩模图像。图10a和图15a中黑色区域表示感兴趣区域,灰色区域为过渡感兴趣区域,白色为非感兴趣区域。Figure 10a shows the final ROI mask image of the texture video frame at time t of the test sequence "Ballet"; Figure 15a shows the final ROI mask image of the texture video frame at time t of the test sequence "Door Flower" Region mask image. In Fig. 10a and Fig. 15a, the black region represents the region of interest, the gray region is the transition region of interest, and the white region is the non-interest region.
⑤-4、记最终的感兴趣区域掩模图像中坐标为(x,y)的像素的像素值为r(x,y),将最终的感兴趣区域掩模图像中非感兴趣区域块中的所有像素的像素值置为r(x,y)=255,将最终的感兴趣区域掩模图像中NR级过渡感兴趣区域中的所有像素的像素值置为
图10b给出了测试序列“Ballet”的t时刻的纹理视频帧的感兴趣区域;图15b给出了测试序列“Door Flower”的t时刻的纹理视频帧的感兴趣区域。图10b和图15b中的感兴趣区域与t时刻的纹理视频帧的像素值相同,显示彩色的纹理内容,过渡感兴趣区域通过降低亮度显示的暗灰色区域,平滑白色区域为与感兴趣区域掩模图像的白色区域对应的非感兴趣区域。作为提取效果对比,图11a和图16a分别给出了传统仅依据静态图像域视觉注意线索提取的测试序列“Ballet”和“Door Flower”t时刻的纹理视频帧的感兴趣区域,没能去除背景纹理丰富的噪声区域;图11b和图16b给出了传统仅依据运动视觉注意线索提取的测试序列“Ballet”和“Door Flower”t时刻的纹理视频帧的感兴趣区域,对于“Ballet”序列,仅依据运动视觉注意线索感兴趣区域提取方法不能完整提取运动非常缓慢的男士,同时运动影子引起的背景噪声严重;对于“Door Flower”序列,仅依据运动视觉注意线索感兴趣区域提取方法,仅提取运动区域,却没有考虑纹理复杂性和立体视觉提供的深度感。图11c和图16c给出了依据静态图像域视觉注意与运动视觉注意线索联合的测试序列“Ballet”和“Door Flower”t时刻的纹理视频帧的感兴趣区域,虽然该方法联合静态和运动视觉信息,然而背景环境中的纹理区域和运动噪声并不能有效抑制。Fig. 10b shows the region of interest of the texture video frame at time t of the test sequence "Ballet"; Fig. 15b shows the region of interest of the texture video frame of the test sequence "Door Flower" at time t. The region of interest in Figure 10b and Figure 15b has the same pixel value as the texture video frame at time t, displaying colored texture content, the transition region of interest is a dark gray region displayed by reducing the brightness, and the smooth white region is a masked area with the region of interest The non-interest region corresponding to the white region of the model image. As a comparison of the extraction effects, Figure 11a and Figure 16a respectively show the regions of interest of the texture video frames of the test sequences "Ballet" and "Door Flower" at time t, which are traditionally extracted only based on visual attention cues in the static image domain, and the background cannot be removed. Noise area with rich texture; Figure 11b and Figure 16b show the area of interest of the texture video frame of the test sequence "Ballet" and "Door Flower" at time t, which is traditionally extracted only based on motion visual attention cues. For the "Ballet" sequence, The ROI extraction method based only on motion visual attention cues cannot fully extract the man who moves very slowly, and the background noise caused by moving shadows is serious; for the "Door Flower" sequence, only based on the ROI extraction method based on motion visual attention regions of motion without taking into account texture complexity and the sense of depth provided by stereo vision. Figure 11c and Figure 16c show the region of interest of the texture video frame at time t of the test sequences "Ballet" and "Door Flower" based on the combination of static image domain visual attention and motion visual attention cues, although the method combines static and motion vision information, however textured regions and motion noise in the background environment cannot be effectively suppressed.
从图10a、图10b与图11a、图11b、图11c,图15a、图15b与图16a、图16b、图16c间的对比实验可以看出,本发明提取的感兴趣区域融合了静态图像域视觉注意、运动视觉注意和深度视觉注意,有效抑制各视觉注意提取内在的单一性和不准确性,解决了静态图像域视觉注意中的复杂背景引起的噪声问题,解决了运动视觉注意无法提取局部运动和运动幅度小的感兴趣区域,从而提高计算精度,增强算法的稳定性,能够从纹理复杂的背景和运动环境中提取出感兴趣区域。另外,通过本发明获取的感兴趣区域除符合人眼对静态纹理视频帧的视觉感兴趣特性和人眼对运动对象感兴趣的视觉特性外,还符合在立体视觉中对深度感强或距离近的对象感兴趣的深度感知特性,符合人眼立体视觉的语义特征。It can be seen from the comparison experiment between Fig. 10a, Fig. 10b and Fig. 11a, Fig. 11b, Fig. 11c, Fig. 15a, Fig. 15b and Fig. Visual attention, motion visual attention and depth visual attention can effectively suppress the inherent singleness and inaccuracy of each visual attention extraction, solve the noise problem caused by complex background in visual attention in the static image domain, and solve the problem that motion visual attention cannot extract local The region of interest with small motion and motion range can improve the calculation accuracy, enhance the stability of the algorithm, and can extract the region of interest from the background with complex texture and the moving environment. In addition, the region of interest obtained through the present invention not only conforms to the visual characteristics of human eyes on static texture video frames and the visual characteristics of human eyes on moving objects, but also conforms to the strong sense of depth or short distance in stereo vision. The depth perception characteristics of the object of interest conform to the semantic characteristics of human stereo vision.
⑥重复步骤①~⑤直至处理完纹理视频中的所有纹理视频帧,获取纹理视频的视频感兴趣区域。
在此具体实施例中,当前纹理视频帧的静态图像域视觉注意的分布图SI、当前纹理视频帧的运动视觉注意的分布图SM、三维视频图像的深度视觉注意的分布图SD、三维视觉注意的分布图S均为ZS比特深度表示的灰度图,纹理视频对应的深度视频中各时刻的深度视频帧为ZD比特深度表示的灰度图,而在此灰度图均采用了256色,用8位深度表示,因此,取ZS=8,ZD=8,当然在实际应用过程中也可采用其他比特深度表示灰度图,比如16位深度,如果用16位深度表示灰度图的话,则表示精度会更高一些。In this specific embodiment, the distribution map S I of the static image domain visual attention of the current texture video frame, the distribution map S M of the motion visual attention of the current texture video frame, the distribution map S D of the depth visual attention of the three-dimensional video image, The distribution map S of three-dimensional visual attention is a grayscale image represented by Z S bit depth, and the depth video frames at each moment in the depth video corresponding to the texture video are grayscale images represented by Z D bit depth, and here the grayscale images are all 256 colors are used, represented by 8-bit depth, therefore, Z S =8, Z D =8, of course, other bit depths can also be used to represent grayscale images in practical applications, such as 16-bit depth, if 16-bit If the depth represents the grayscale image, the accuracy will be higher.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2009101525203A CN101651772B (en) | 2009-09-11 | 2009-09-11 | Method for extracting video interested region based on visual attention |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2009101525203A CN101651772B (en) | 2009-09-11 | 2009-09-11 | Method for extracting video interested region based on visual attention |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN101651772A true CN101651772A (en) | 2010-02-17 |
| CN101651772B CN101651772B (en) | 2011-03-16 |
Family
ID=41673862
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN2009101525203A Expired - Fee Related CN101651772B (en) | 2009-09-11 | 2009-09-11 | Method for extracting video interested region based on visual attention |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN101651772B (en) |
Cited By (30)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101853513A (en) * | 2010-06-06 | 2010-10-06 | 华中科技大学 | A Spatiotemporal Saliency Visual Attention Method Based on Information Entropy |
| CN101894371A (en) * | 2010-07-19 | 2010-11-24 | 华中科技大学 | A bioinspired top-down approach to visual attention |
| CN101950362A (en) * | 2010-09-14 | 2011-01-19 | 武汉大学 | Analytical system for attention of video signal |
| CN101964911A (en) * | 2010-10-09 | 2011-02-02 | 浙江大学 | Ground power unit (GPU)-based video layering method |
| CN102036073A (en) * | 2010-12-21 | 2011-04-27 | 西安交通大学 | Method for encoding and decoding JPEG2000 image based on vision potential attention target area |
| CN102034267A (en) * | 2010-11-30 | 2011-04-27 | 中国科学院自动化研究所 | Three-dimensional reconstruction method of target based on attention |
| CN102063623A (en) * | 2010-12-28 | 2011-05-18 | 中南大学 | Method for extracting image region of interest by combining bottom-up and top-down ways |
| CN102496024A (en) * | 2011-11-25 | 2012-06-13 | 山东大学 | Method for detecting incident triggered by characteristic frame in intelligent monitor |
| CN102572216A (en) * | 2010-12-15 | 2012-07-11 | 佳能株式会社 | Image processing apparatus and image processing method thereof |
| CN102630025A (en) * | 2011-02-03 | 2012-08-08 | 美国博通公司 | Method and system for processing signal |
| CN102663741A (en) * | 2012-03-22 | 2012-09-12 | 北京佳泰信业技术有限公司 | Method for carrying out visual stereo perception enhancement on color digit image and system thereof |
| CN103095996A (en) * | 2013-01-25 | 2013-05-08 | 西安电子科技大学 | Multi-sensor video fusion method based on space-time conspicuousness detection |
| CN103546736A (en) * | 2012-07-12 | 2014-01-29 | 三星电子株式会社 | Image processing device and method |
| CN103797510A (en) * | 2011-07-07 | 2014-05-14 | Ati科技无限责任公司 | View Image Processing for Focus Orientation |
| CN104318569A (en) * | 2014-10-27 | 2015-01-28 | 北京工业大学 | Space salient region extraction method based on depth variation model |
| US8994792B2 (en) | 2010-08-27 | 2015-03-31 | Broadcom Corporation | Method and system for creating a 3D video from a monoscopic 2D video and corresponding depth information |
| WO2015188666A1 (en) * | 2014-06-13 | 2015-12-17 | 华为技术有限公司 | Three-dimensional video filtering method and device |
| CN105550685A (en) * | 2015-12-11 | 2016-05-04 | 哈尔滨工业大学 | Visual attention mechanism based region-of-interest extraction method for large-format remote sensing image |
| CN105893999A (en) * | 2016-03-31 | 2016-08-24 | 北京奇艺世纪科技有限公司 | Method and device for extracting a region of interest |
| CN108961261A (en) * | 2018-03-14 | 2018-12-07 | 中南大学 | A kind of optic disk region OCT image Hierarchical Segmentation method based on spatial continuity constraint |
| CN109754357A (en) * | 2018-01-26 | 2019-05-14 | 京东方科技集团股份有限公司 | Image processing method, processing device, and processing device |
| CN109903247A (en) * | 2019-02-22 | 2019-06-18 | 西安工程大学 | High-precision grayscale method for color images based on Gaussian color space correlation |
| CN110070538A (en) * | 2019-04-28 | 2019-07-30 | 华北电力大学(保定) | Bolt two-dimensional visual documents structured Cluster method based on form optimization depth characteristic |
| CN110110578A (en) * | 2019-02-21 | 2019-08-09 | 北京工业大学 | A kind of indoor scene semanteme marking method |
| CN110399842A (en) * | 2019-07-26 | 2019-11-01 | 北京奇艺世纪科技有限公司 | Method for processing video frequency, device, electronic equipment and computer readable storage medium |
| CN110675940A (en) * | 2019-08-01 | 2020-01-10 | 平安科技(深圳)有限公司 | Pathological image labeling method and device, computer equipment and storage medium |
| CN111723829A (en) * | 2019-03-18 | 2020-09-29 | 四川大学 | A fully convolutional object detection method based on attention mask fusion |
| CN112654546A (en) * | 2020-04-30 | 2021-04-13 | 华为技术有限公司 | Method and device for identifying object of interest of user |
| CN113572958A (en) * | 2021-07-15 | 2021-10-29 | 杭州海康威视数字技术股份有限公司 | Method and equipment for automatically triggering camera to focus |
| CN113936015A (en) * | 2021-12-17 | 2022-01-14 | 青岛美迪康数字工程有限公司 | Method and device for extracting effective region of image |
-
2009
- 2009-09-11 CN CN2009101525203A patent/CN101651772B/en not_active Expired - Fee Related
Cited By (50)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101853513A (en) * | 2010-06-06 | 2010-10-06 | 华中科技大学 | A Spatiotemporal Saliency Visual Attention Method Based on Information Entropy |
| CN101894371A (en) * | 2010-07-19 | 2010-11-24 | 华中科技大学 | A bioinspired top-down approach to visual attention |
| CN101894371B (en) * | 2010-07-19 | 2011-11-30 | 华中科技大学 | Bio-inspired top-down visual attention method |
| US8994792B2 (en) | 2010-08-27 | 2015-03-31 | Broadcom Corporation | Method and system for creating a 3D video from a monoscopic 2D video and corresponding depth information |
| CN101950362A (en) * | 2010-09-14 | 2011-01-19 | 武汉大学 | Analytical system for attention of video signal |
| CN101950362B (en) * | 2010-09-14 | 2013-01-09 | 武汉大学 | Analytical system for attention of video signal |
| CN101964911A (en) * | 2010-10-09 | 2011-02-02 | 浙江大学 | Ground power unit (GPU)-based video layering method |
| CN102034267A (en) * | 2010-11-30 | 2011-04-27 | 中国科学院自动化研究所 | Three-dimensional reconstruction method of target based on attention |
| CN102572216B (en) * | 2010-12-15 | 2015-07-22 | 佳能株式会社 | Image processing apparatus and image processing method thereof |
| CN102572216A (en) * | 2010-12-15 | 2012-07-11 | 佳能株式会社 | Image processing apparatus and image processing method thereof |
| US8699797B2 (en) | 2010-12-15 | 2014-04-15 | Canon Kabushiki Kaisha | Image processing apparatus, image processing method thereof, and computer-readable storage medium |
| CN102036073A (en) * | 2010-12-21 | 2011-04-27 | 西安交通大学 | Method for encoding and decoding JPEG2000 image based on vision potential attention target area |
| CN102036073B (en) * | 2010-12-21 | 2012-11-28 | 西安交通大学 | Method for encoding and decoding JPEG2000 image based on vision potential attention target area |
| CN102063623A (en) * | 2010-12-28 | 2011-05-18 | 中南大学 | Method for extracting image region of interest by combining bottom-up and top-down ways |
| CN102630025A (en) * | 2011-02-03 | 2012-08-08 | 美国博通公司 | Method and system for processing signal |
| CN102630025B (en) * | 2011-02-03 | 2015-10-28 | 美国博通公司 | A kind of method and system of processing signals |
| CN103797510A (en) * | 2011-07-07 | 2014-05-14 | Ati科技无限责任公司 | View Image Processing for Focus Orientation |
| CN102496024A (en) * | 2011-11-25 | 2012-06-13 | 山东大学 | Method for detecting incident triggered by characteristic frame in intelligent monitor |
| CN102663741B (en) * | 2012-03-22 | 2014-09-24 | 侯克杰 | Method for carrying out visual stereo perception enhancement on color digit image and system thereof |
| CN102663741A (en) * | 2012-03-22 | 2012-09-12 | 北京佳泰信业技术有限公司 | Method for carrying out visual stereo perception enhancement on color digit image and system thereof |
| CN103546736A (en) * | 2012-07-12 | 2014-01-29 | 三星电子株式会社 | Image processing device and method |
| US9661296B2 (en) | 2012-07-12 | 2017-05-23 | Samsung Electronics Co., Ltd. | Image processing apparatus and method |
| CN103546736B (en) * | 2012-07-12 | 2016-12-28 | 三星电子株式会社 | Image processing equipment and method |
| CN103095996A (en) * | 2013-01-25 | 2013-05-08 | 西安电子科技大学 | Multi-sensor video fusion method based on space-time conspicuousness detection |
| WO2015188666A1 (en) * | 2014-06-13 | 2015-12-17 | 华为技术有限公司 | Three-dimensional video filtering method and device |
| CN104318569A (en) * | 2014-10-27 | 2015-01-28 | 北京工业大学 | Space salient region extraction method based on depth variation model |
| CN104318569B (en) * | 2014-10-27 | 2017-02-22 | 北京工业大学 | Space salient region extraction method based on depth variation model |
| CN105550685A (en) * | 2015-12-11 | 2016-05-04 | 哈尔滨工业大学 | Visual attention mechanism based region-of-interest extraction method for large-format remote sensing image |
| CN105550685B (en) * | 2015-12-11 | 2019-01-08 | 哈尔滨工业大学 | The large format remote sensing image area-of-interest exacting method of view-based access control model attention mechanism |
| CN105893999A (en) * | 2016-03-31 | 2016-08-24 | 北京奇艺世纪科技有限公司 | Method and device for extracting a region of interest |
| CN109754357A (en) * | 2018-01-26 | 2019-05-14 | 京东方科技集团股份有限公司 | Image processing method, processing device, and processing device |
| CN109754357B (en) * | 2018-01-26 | 2021-09-21 | 京东方科技集团股份有限公司 | Image processing method, processing device and processing equipment |
| CN108961261A (en) * | 2018-03-14 | 2018-12-07 | 中南大学 | A kind of optic disk region OCT image Hierarchical Segmentation method based on spatial continuity constraint |
| CN108961261B (en) * | 2018-03-14 | 2022-02-15 | 中南大学 | Optic disk region OCT image hierarchy segmentation method based on space continuity constraint |
| CN110110578A (en) * | 2019-02-21 | 2019-08-09 | 北京工业大学 | A kind of indoor scene semanteme marking method |
| CN110110578B (en) * | 2019-02-21 | 2023-09-29 | 北京工业大学 | Indoor scene semantic annotation method |
| CN109903247A (en) * | 2019-02-22 | 2019-06-18 | 西安工程大学 | High-precision grayscale method for color images based on Gaussian color space correlation |
| CN111723829B (en) * | 2019-03-18 | 2022-05-06 | 四川大学 | A fully convolutional object detection method based on attention mask fusion |
| CN111723829A (en) * | 2019-03-18 | 2020-09-29 | 四川大学 | A fully convolutional object detection method based on attention mask fusion |
| CN110070538B (en) * | 2019-04-28 | 2022-04-15 | 华北电力大学(保定) | Two-dimensional visual structure clustering method of bolts based on morphological optimization depth features |
| CN110070538A (en) * | 2019-04-28 | 2019-07-30 | 华北电力大学(保定) | Bolt two-dimensional visual documents structured Cluster method based on form optimization depth characteristic |
| CN110399842B (en) * | 2019-07-26 | 2021-09-28 | 北京奇艺世纪科技有限公司 | Video processing method and device, electronic equipment and computer readable storage medium |
| CN110399842A (en) * | 2019-07-26 | 2019-11-01 | 北京奇艺世纪科技有限公司 | Method for processing video frequency, device, electronic equipment and computer readable storage medium |
| CN110675940A (en) * | 2019-08-01 | 2020-01-10 | 平安科技(深圳)有限公司 | Pathological image labeling method and device, computer equipment and storage medium |
| CN112654546A (en) * | 2020-04-30 | 2021-04-13 | 华为技术有限公司 | Method and device for identifying object of interest of user |
| WO2021217575A1 (en) * | 2020-04-30 | 2021-11-04 | 华为技术有限公司 | Identification method and identification device for object of interest of user |
| CN112654546B (en) * | 2020-04-30 | 2022-08-02 | 华为技术有限公司 | Identification method and identification device of object of interest to user |
| CN113572958A (en) * | 2021-07-15 | 2021-10-29 | 杭州海康威视数字技术股份有限公司 | Method and equipment for automatically triggering camera to focus |
| CN113936015B (en) * | 2021-12-17 | 2022-03-25 | 青岛美迪康数字工程有限公司 | Method and device for extracting effective region of image |
| CN113936015A (en) * | 2021-12-17 | 2022-01-14 | 青岛美迪康数字工程有限公司 | Method and device for extracting effective region of image |
Also Published As
| Publication number | Publication date |
|---|---|
| CN101651772B (en) | 2011-03-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN101651772B (en) | Method for extracting video interested region based on visual attention | |
| CN101588445B (en) | A Depth-Based Method for Extracting Regions of Interest in Video | |
| CN102741879B (en) | Method and system for generating depth map from monocular image | |
| US8488868B2 (en) | Generation of a depth map from a monoscopic color image for rendering stereoscopic still and video images | |
| CN102271254B (en) | A Preprocessing Method of Depth Image | |
| RU2587425C2 (en) | Method of producing high-quality image depth map | |
| CN101699512B (en) | Depth generating method based on background difference sectional drawing and sparse optical flow method | |
| US20120274626A1 (en) | Stereoscopic Image Generating Apparatus and Method | |
| CN101180653A (en) | Method and device for three-dimensional rendering | |
| US20110249886A1 (en) | Image converting device and three-dimensional image display device including the same | |
| CN105069808A (en) | Video image depth estimation method based on image segmentation | |
| CN106127799B (en) | A kind of visual attention detection method for 3 D video | |
| CN102203829A (en) | Method and device for generating a depth map | |
| CN103996198A (en) | Method for detecting region of interest in complicated natural environment | |
| Kuo et al. | Depth estimation from a monocular view of the outdoors | |
| CN102263979A (en) | Method and device for generating depth map for stereoscopic planar video | |
| CN105869115A (en) | Depth image super-resolution method based on kinect2.0 | |
| CN119251618A (en) | Infrared small target detection method based on wavelet guided state space model | |
| Fan et al. | Vivid-DIBR based 2D–3D image conversion system for 3D display | |
| CN102780900B (en) | Image display method of multi-person multi-view stereoscopic display | |
| CN103152569A (en) | Video ROI (region of interest) compression method based on depth information | |
| CN115188039B (en) | Depth fake video technology tracing method based on image frequency domain information | |
| CN101610422A (en) | Method for compressing three-dimensional image video sequence | |
| CN104143203A (en) | A method of image editing and dissemination | |
| Yang et al. | Depth map generation using local depth hypothesis for 2D-to-3D conversion |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| ASS | Succession or assignment of patent right |
Owner name: SHANGHAI SILICON INTELLECTUAL PROPERTY EXCHANGE CE Free format text: FORMER OWNER: NINGBO UNIVERSITY Effective date: 20120105 |
|
| C41 | Transfer of patent application or patent right or utility model | ||
| COR | Change of bibliographic data |
Free format text: CORRECT: ADDRESS; FROM: 315211 NINGBO, ZHEJIANG PROVINCE TO: 200030 XUHUI, SHANGHAI |
|
| TR01 | Transfer of patent right |
Effective date of registration: 20120105 Address after: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1704 Patentee after: Shanghai Silicon Intellectual Property Exchange Co.,Ltd. Address before: 315211 Zhejiang Province, Ningbo Jiangbei District Fenghua Road No. 818 Patentee before: Ningbo University |
|
| ASS | Succession or assignment of patent right |
Owner name: SHANGHAI SIPAI KESI TECHNOLOGY CO., LTD. Free format text: FORMER OWNER: SHANGHAI SILICON INTELLECTUAL PROPERTY EXCHANGE CENTER CO., LTD. Effective date: 20120217 |
|
| C41 | Transfer of patent application or patent right or utility model | ||
| COR | Change of bibliographic data |
Free format text: CORRECT: ADDRESS; FROM: 200030 XUHUI, SHANGHAI TO: 201203 PUDONG NEW AREA, SHANGHAI |
|
| TR01 | Transfer of patent right |
Effective date of registration: 20120217 Address after: 201203 Shanghai Chunxiao Road No. 350 South Building Room 207 Patentee after: Shanghai spparks Technology Co.,Ltd. Address before: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1704 Patentee before: Shanghai Silicon Intellectual Property Exchange Co.,Ltd. |
|
| ASS | Succession or assignment of patent right |
Owner name: SHANGHAI GUIZHI INTELLECTUAL PROPERTY SERVICE CO., Free format text: FORMER OWNER: SHANGHAI SIPAI KESI TECHNOLOGY CO., LTD. Effective date: 20120606 |
|
| C41 | Transfer of patent application or patent right or utility model | ||
| C56 | Change in the name or address of the patentee | ||
| CP02 | Change in the address of a patent holder |
Address after: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1706 Patentee after: Shanghai spparks Technology Co.,Ltd. Address before: 201203 Shanghai Chunxiao Road No. 350 South Building Room 207 Patentee before: Shanghai spparks Technology Co.,Ltd. |
|
| TR01 | Transfer of patent right |
Effective date of registration: 20120606 Address after: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1704 Patentee after: Shanghai Guizhi Intellectual Property Service Co.,Ltd. Address before: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1706 Patentee before: Shanghai spparks Technology Co.,Ltd. |
|
| DD01 | Delivery of document by public notice |
Addressee: Shi Lingling Document name: Notification of Passing Examination on Formalities |
|
| TR01 | Transfer of patent right | ||
| TR01 | Transfer of patent right |
Effective date of registration: 20200120 Address after: 201203 block 22301-1450, building 14, No. 498, GuoShouJing Road, Pudong New Area (Shanghai) pilot Free Trade Zone, Shanghai Patentee after: Shanghai spparks Technology Co.,Ltd. Address before: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1704 Patentee before: Shanghai Guizhi Intellectual Property Service Co.,Ltd. |
|
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20110316 Termination date: 20200911 |
|
| CF01 | Termination of patent right due to non-payment of annual fee |