CN102665049A

CN102665049A - Programmable visual chip-based visual image processing system

Info

Publication number: CN102665049A
Application number: CN2012100884200A
Authority: CN
Inventors: 石匆; 吴南健; 龙希田; 杨杰; 秦琦
Original assignee: Institute of Semiconductors of CAS
Current assignee: Institute of Semiconductors of CAS
Priority date: 2012-03-29
Filing date: 2012-03-29
Publication date: 2012-09-12
Anticipated expiration: 2032-03-29
Also published as: CN102665049B

Abstract

The invention discloses a visual image processing system based on a programmable visual chip, which includes an image sensor and a multi-stage parallel digital processing circuit. The image sensor mainly includes a pixel array, an analog preprocessing circuit array and an analog-to-digital conversion circuit array, and the digital processing circuit mainly includes a pixel-level parallel processing unit array, a row-parallel processing unit array, an on-chip artificial neural network, and a reduced instruction processor dual-core system. The system can realize high-speed and high-quality image acquisition and multi-level parallel image processing, and can realize a variety of high-speed intelligent vision applications through programming. Compared with traditional image systems, it has the advantages of high speed, high integration, low power consumption, and low cost. The present invention proposes an embodiment to realize the above architecture and various high-speed intelligent visual image processing algorithms based on this embodiment, including high-speed motion detection, high-speed gesture recognition and fast face detection, and the processing speed can reach 1000 frames per second. Meet the high-speed real-time processing requirements.

Description

Visual Image Processing System Based on Programmable Visual Chip

技术领域 technical field

本发明涉及可编程视觉芯片及图像处理技术领域，尤其涉及一种基于可编程视觉芯片的视觉图像处理系统，具有高速度、高集成、低功耗、低成本的优势，可应用于多种嵌入式高速实时视觉图像处理系统，实现包括高速目标追踪、自然人机交互、环境监控、智能交通、机器人视觉等在内的各种智能视觉图像应用。The present invention relates to the technical field of programmable visual chips and image processing, in particular to a visual image processing system based on programmable visual chips, which has the advantages of high speed, high integration, low power consumption, and low cost, and can be applied to various embedded High-speed real-time visual image processing system realizes various intelligent visual image applications including high-speed target tracking, natural human-computer interaction, environmental monitoring, intelligent transportation, robot vision, etc.

背景技术 Background technique

传统的视觉图像处理系统包括分立的摄像头和通用处理器(或数字信号处理器(DSP))，摄像头使用图像传感器获取图像，并将获取的大量原始图像数据串行传送到通用处理器或DSP中进行处理，由于是串行传送，所以存在严重的带宽限制。另一方面，在通用处理器或DSP中利用软件对图像进行处理往往也是逐个像素串行处理的，存在串行处理的瓶颈。由于串行传输和串行处理的限制，传统视觉图像系统一般只能达到30帧/秒的速度，远远无法满足高速实时性需求，比如某些工业控制系统中经常要求1000帧/秒的速度。The traditional visual image processing system includes a discrete camera and a general-purpose processor (or digital signal processor (DSP)), the camera uses an image sensor to acquire images, and serially transmits a large amount of raw image data acquired to the general-purpose processor or DSP processing, there are severe bandwidth constraints due to serial transfer. On the other hand, using software to process images in a general-purpose processor or DSP is often serially processed pixel by pixel, and there is a bottleneck of serial processing. Due to the limitations of serial transmission and serial processing, traditional visual image systems generally can only reach a speed of 30 frames per second, which is far from meeting the high-speed real-time requirements. For example, some industrial control systems often require a speed of 1000 frames per second .

而视觉芯片的出现有效的满足了高速实时性需求，该视觉芯片模仿人类视觉系统的原理，将图像传感器和图像处理电路集成在同一块芯片内，图像传感器获取的图像数据被并行传送到图像处理电路中，而图像处理电路本身在硬件上是采用像素级大规模并行体系架构，最终图像处理电路输出少量图像特征数据或分析识别结果，从而很好的克服了传统视觉图像处理系统中数据串行传输和串行处理的瓶颈，实时性得到大幅提升，不少采用视觉芯片的系统可以达到1000帧/秒以上的处理速度。The emergence of the vision chip effectively meets the high-speed real-time requirements. The vision chip imitates the principle of the human visual system, integrates the image sensor and the image processing circuit in the same chip, and the image data acquired by the image sensor is transmitted to the image processing in parallel. In the circuit, the image processing circuit itself adopts pixel-level large-scale parallel architecture in hardware, and finally the image processing circuit outputs a small amount of image feature data or analysis and recognition results, thus well overcoming the problem of data serialization in traditional visual image processing systems. The bottleneck of transmission and serial processing, the real-time performance has been greatly improved, and many systems using visual chips can achieve a processing speed of more than 1000 frames per second.

视觉芯片可分为专用视觉芯片和可编程视觉芯片，由于后者可通过编程灵活实现多种应用，应对复杂多变的实际环境，因此具有更大的实用价值。Vision chips can be divided into dedicated vision chips and programmable vision chips. The latter has greater practical value because it can flexibly implement multiple applications through programming and deal with complex and changeable actual environments.

但是，目前国内外对可编程视觉芯片体系架构的研究存在严重不足，表现在：However, there are serious deficiencies in the research on the architecture of programmable vision chips at home and abroad, as shown in:

(1)每一个像素单元都包含感光元、读出电路和处理电路，芯片面积较大，极大地限制了分辨率和填充率，原始图像质量差；而且由于感光元和读出电路是模拟电路，因此处理电路也往往使用模拟电路，导致图像处理的可靠性和灵活性较较差。(1) Each pixel unit includes a photosensitive element, a readout circuit and a processing circuit. The chip area is large, which greatly limits the resolution and filling rate, and the original image quality is poor; and because the photosensitive element and the readout circuit are analog circuits , so the processing circuit often uses analog circuits, resulting in poor reliability and flexibility of image processing.

(2)这些像素单元排列成二维阵列，工作在单指令多数据(SIMD)模式下，可实现全像素并行图像采集及局域处理，但无法实现快速灵活的广域处理；(2) These pixel units are arranged in a two-dimensional array and work in the single instruction multiple data (SIMD) mode, which can realize full-pixel parallel image acquisition and local processing, but cannot realize fast and flexible wide-area processing;

(3)上述工作在SIMD模式下的可编程视觉芯片体系架构支持低级图像处理和部分中级图像处理，但缺乏高级图像处理功能，尤其缺乏类似人脑神经的简单直观的快速特征识别能力，因此必须借助外部通用处理器才能组成完整的视觉图像系统，这样就限制了视觉芯片在某些对体积、功耗和成本有严格要求的嵌入式场合的应用。(3) The above-mentioned programmable vision chip architecture working in SIMD mode supports low-level image processing and some intermediate-level image processing, but lacks advanced image processing functions, especially the simple and intuitive fast feature recognition ability similar to human brain nerves, so it must A complete visual image system can only be formed with the help of an external general-purpose processor, which limits the application of visual chips in some embedded occasions that have strict requirements on volume, power consumption and cost.

发明内容 Contents of the invention

(一)要解决的技术问题(1) Technical problems to be solved

针对以上可编程视觉芯片存在的问题，本发明提供了一种像素单元和处理电路分离的、基于多级并行数字处理的、且带有片上人工神经网络的、基于可编程视觉芯片的视觉图像处理系统，以达到较高的分辨率和填充率，结合局域处理和广域处理功能，支持灵活快速的低、中、高级图像处理和片上反馈控制，实现功能完整的片上视觉系统，通过多种典型的高速智能视觉应用算法，其处理速度可达到1000帧/秒。Aiming at the problems existing in the above programmable visual chip, the present invention provides a visual image processing based on a programmable visual chip that separates pixel units and processing circuits, is based on multi-level parallel digital processing, and has an on-chip artificial neural network. System, in order to achieve higher resolution and fill rate, combined with local area processing and wide area processing functions, supports flexible and fast low, medium, and high-level image processing and on-chip feedback control, and realizes a full-featured on-chip vision system, through a variety of Typical high-speed intelligent vision application algorithm, its processing speed can reach 1000 frames per second.

(二)技术方案(2) Technical solution

为达到上述目的，本发明提供了一种基于可编程视觉芯片的视觉图像处理系统，包括：To achieve the above object, the invention provides a visual image processing system based on a programmable visual chip, comprising:

图像传感器，用于高速采集原始图像数据，并将采集的该原始图像数据并行传输到多级并行数字处理电路；以及an image sensor for collecting raw image data at high speed, and transmitting the collected raw image data in parallel to a multi-stage parallel digital processing circuit; and

多级并行数字处理电路，用于对接收自图像传感器的该原始图像数据进行快速并行处理，输出处理结果。The multi-stage parallel digital processing circuit is used for performing fast parallel processing on the original image data received from the image sensor, and outputting the processing result.

上述方案中，所述图像传感器包括：In the above solution, the image sensor includes:

N×N像素阵列1，用于高速采集原始图像数据，并将采集的该原始图像数据输出给N×1行并行模拟预处理阵列3，其中N为自然数；The N×N pixel array 1 is used to collect raw image data at high speed, and output the collected raw image data to an N×1 parallel analog preprocessing array 3, where N is a natural number;

N×1行并行模拟预处理阵列3，用于去除该原始图像数据中的固定噪声，提高该原始图像数据的动态范围，并输出给N×1行并行模数转换阵列4；N×1 parallel analog preprocessing array 3, used to remove fixed noise in the original image data, improve the dynamic range of the original image data, and output to N×1 parallel analog-to-digital conversion array 4;

N×1行并行模数转换阵列4，用于将每一列模拟像素数据转换为高精度数字像素数据，并输出给输出像素选择模块5；N×1 parallel analog-to-digital conversion array 4, used to convert each column of analog pixel data into high-precision digital pixel data, and output to the output pixel selection module 5;

输出像素选择模块5，用于并行接收所述N×1行并行模数转换阵列4的N个数字像素数据作为输入，并从中选择M个像素数据作为该图像传感器的输出，实现对像素行的选择，其中M为自然数且M＜N；以及The output pixel selection module 5 is used to receive the N digital pixel data of the N×1 row parallel analog-to-digital conversion array 4 as input in parallel, and select M pixel data therefrom as the output of the image sensor, so as to realize the pixel row selection Select, where M is a natural number and M<N; and

图像传感器控制模块6，用于根据内部的参数寄存器控制N×N像素阵列1、N×1行并行模拟预处理阵列3、N×1行并行模数转换阵列4和输出像素选择模块5的工作时序，实现对该图像传感器的动态控制。The image sensor control module 6 is used to control the work of the N×N pixel array 1, the N×1 row parallel analog preprocessing array 3, the N×1 row parallel analog-to-digital conversion array 4 and the output pixel selection module 5 according to the internal parameter registers timing to achieve dynamic control of the image sensor.

上述方案中，所述N×N像素阵列1包含N×N个二维排列的像素单元2，其中每个像素单元2均包含感光元和相应的读出电路；所述N×1行并行模拟预处理阵列3包含N个一维排列的模拟预处理单元，其中每个模拟预处理单元均包含用于去除固定噪声的相关双采样(CDS)电路和用于提高动态范围的可控增益放大电路(PGA)；所述N×1行并行模数转换阵列4包含N个一维排列的模数转换单元；所述输出像素选择模块5配合图像传感器控制模块6对像素行列的选择，实现对该图像传感器灵活的区域处理和/或亚采样处理。In the above solution, the N×N pixel array 1 includes N×N two-dimensionally arranged pixel units 2, wherein each pixel unit 2 includes a photosensitive element and a corresponding readout circuit; the N×1 row parallel simulation The preprocessing array 3 includes N one-dimensionally arranged analog preprocessing units, wherein each analog preprocessing unit includes a correlated double sampling (CDS) circuit for removing fixed noise and a controllable gain amplification circuit for improving the dynamic range (PGA); the N × 1 row parallel analog-to-digital conversion array 4 includes N one-dimensionally arranged analog-to-digital conversion units; the output pixel selection module 5 cooperates with the image sensor control module 6 to select pixel rows and columns to realize the Flexible region processing and/or subsampling for image sensors.

上述方案中，所述图像传感器控制模块6中的参数寄存器，其中的数据能够通过片上总线接口从模块外部进行读写，实现对该图像传感器的动态控制。In the above solution, the data in the parameter register in the image sensor control module 6 can be read and written from the outside of the module through the on-chip bus interface, so as to realize the dynamic control of the image sensor.

上述方案中，所述图像传感器控制模块6控制所述N×N像素阵列1滚动曝光，并且每次选择其中一列以行并行方式输出N个模拟像素值至所述N×1行并行模拟预处理阵列3，通过所述N×1行并行模拟预处理阵列3进行噪声去除和动态范围提升，然后进入所述N×1行并行模数转换阵列4并行转换为高精度数字像素数据，最后通过所述输出像素选择模块5输出M个数字像素数据作为该图像传感器的最终输出，提供给所述多级并行数字处理电路。In the above solution, the image sensor control module 6 controls the rolling exposure of the N×N pixel array 1, and selects one of the columns each time to output N analog pixel values in a row-parallel manner to the N×1 row-parallel analog preprocessing The array 3 is used for noise removal and dynamic range improvement through the N×1 parallel analog preprocessing array 3, and then enters the N×1 parallel analog-to-digital conversion array 4 for parallel conversion into high-precision digital pixel data, and finally passes the The output pixel selection module 5 outputs M digital pixel data as the final output of the image sensor, which is provided to the multi-stage parallel digital processing circuit.

上述方案中，所述多级并行数字处理电路包括：In the above solution, the multi-stage parallel digital processing circuit includes:

M×M像素级并行处理单元阵列7，用于对接收自图像传感器的数字像素数据进行适合像素级并行的局域线性处理，并将处理结果输出给M×1行处理单元阵列9，其中M为自然数且M＜N；The M×M pixel-level parallel processing unit array 7 is used to perform local linear processing suitable for pixel-level parallel processing on the digital pixel data received from the image sensor, and output the processing result to the M×1 row processing unit array 9, wherein M Is a natural number and M<N;

M×1行处理单元阵列9，用于加速低、中级图像中适合以行并行方式完成的非线性处理和广域处理，实现对图像特征的提取；M×1 row processing unit array 9, used to accelerate the non-linear processing and wide-area processing suitable for row-parallel processing in low- and middle-level images, so as to realize the extraction of image features;

处理阵列控制模块11，用于从其内部变长单指令多数据(SIMD)指令存储器中取出控制所述M×M像素级并行处理单元阵列7和所述M×1行处理单元阵列9的控制指令，并译码输出到所述M×M像素级并行处理单元阵列7和所述M×1行处理单元阵列9；The processing array control module 11 is used to fetch the control of controlling the M×M pixel-level parallel processing unit array 7 and the M×1 row processing unit array 9 from its internal variable-length single instruction multiple data (SIMD) instruction memory. Instructions are decoded and output to the M×M pixel-level parallel processing unit array 7 and the M×1 row processing unit array 9;

片上可配置人工神经网络12，用于完成高级图像处理中的特征识别或特征压缩任务，其输入为所述M×1行处理单元阵列9提取的特征向量数据，输出为特征识别的结果；An artificial neural network 12 can be configured on the chip to complete feature recognition or feature compression tasks in advanced image processing, and its input is the feature vector data extracted by the M×1 row processing unit array 9, and the output is the result of feature recognition;

精简指令处理器双核子系统13，用于实现线程级并行的处理，进行高级图像处理中除正常特征识别以外的不规则处理以及对整个系统的控制；The streamlined instruction processor dual-core subsystem 13 is used to realize thread-level parallel processing, perform irregular processing in advanced image processing other than normal feature recognition, and control the entire system;

随机/顺序混合I/O存储器14；random/sequential mixed I/O memory 14;

系统线程标志15；system thread flag 15;

片上总线16，用于将来自所述精简指令处理器双核子系统13的读写控制信号和逻辑地址信息映射到其他各个总线从器件模块所需的选通使能信号和物理地址信息，以驱动这些从器件模块完成各种操作。The on-chip bus 16 is used to map the read and write control signals and logical address information from the RISC dual-core subsystem 13 to the strobe enable signals and physical address information required by other bus slave modules to drive These slave modules perform various operations.

上述方案中，所述M×M像素级并行处理单元阵列7包含M×M个二维排列的像素级并行处理单元PE8，所有像素级并行处理单元PE8工作在单指令多数据(SIMD)模式下，接受相同的PE阵列控制指令，执行相同的操作，但是所操作的数据来自各个单元本地的存储器。In the above scheme, the M×M pixel-level parallel processing unit array 7 includes M×M two-dimensionally arranged pixel-level parallel processing units PE8, and all pixel-level parallel processing units PE8 work in single instruction multiple data (SIMD) mode , accept the same PE array control command, and perform the same operation, but the data operated comes from the local memory of each unit.

上述方案中，所述每个像素级并行处理单元PE8对应于一帧中所述N×N像素阵列1的一个或多个图像像素，当每个像素级并行处理单元PE8对应一个像素时，由于M＜N，整个处理单元阵列对应于所述N×N像素阵列1的一个M×M的子区域图像或是整个所述N×N像素阵列1的M×M亚采样图像，此时所述M×M像素级并行处理单元阵列7以全并行方式对一帧分辨率为M×M子图像或亚采样图像进行处理；当每个像素级并行处理单元PE8对应多个像素时，整个所述M×M像素级并行处理单元阵列7对应于整个N×N像素阵列1或是N×N像素阵列1中大于M×M的子区域，此时是以部分像素并行的方式对整帧图像进行处理。In the above scheme, each pixel-level parallel processing unit PE8 corresponds to one or more image pixels of the N×N pixel array 1 in one frame, when each pixel-level parallel processing unit PE8 corresponds to one pixel, because M<N, the entire processing unit array corresponds to an M×M sub-region image of the N×N pixel array 1 or an M×M sub-sampled image of the entire N×N pixel array 1, at this time the The M×M pixel-level parallel processing unit array 7 processes a frame of M×M sub-images or sub-sampled images in a fully parallel manner; when each pixel-level parallel processing unit PE8 corresponds to multiple pixels, the entire The M×M pixel-level parallel processing unit array 7 corresponds to the entire N×N pixel array 1 or a sub-region larger than M×M in the N×N pixel array 1. At this time, the entire frame of image is processed in parallel with some pixels. deal with.

上述方案中，该视觉图像处理系统是通过图像传感器控制模块6动态切换像素级并行处理单元PE8与图像像素之间的对应方式，由此实现多分辨率视觉图像处理。In the above solution, the visual image processing system uses the image sensor control module 6 to dynamically switch the correspondence between the pixel-level parallel processing unit PE8 and the image pixels, thereby realizing multi-resolution visual image processing.

上述方案中，所述像素级并行处理单元PE8用于完成基本的1比特求和、求反、求与、求或等算术逻辑操作，低中级图像处理中的多比特算术逻辑运算是通过分解为上述基本1比特运算在所述像素级并行处理单元PE8上实现的；所述像素级并行处理单元PE8的数据可与其上、下、左、右的邻近处理单元进行交互传递，通过多次的邻近处理单元数据传递，每个所述像素级并行处理单元PE8可与任意位置的其他处理单元产生交互。In the above scheme, the pixel-level parallel processing unit PE8 is used to complete basic arithmetic logic operations such as 1-bit summation, negation, summation, and summation, and the multi-bit arithmetic logic operations in low-level image processing are decomposed into The above-mentioned basic 1-bit operation is realized on the pixel-level parallel processing unit PE8; the data of the pixel-level parallel processing unit PE8 can be interactively transmitted with its upper, lower, left, and right adjacent processing units, and through multiple adjacent Processing unit data transfer, each pixel-level parallel processing unit PE8 can interact with other processing units at any position.

上述方案中，所述像素级并行处理单元PE8包括第一操作数选择器31、第二操作数选择器32、1比特算术逻辑运算单元33、1比特临时数据寄存器34和位平面随机存储器35，其中：第一操作数选择器31根据所述处理阵列控制模块11输出的控制指令从本单元或邻近处理单元的位平面存储器35的输出中选择一个作为1比特算术逻辑运算单元33的第一操作数；第二操作数选择器32根据所述处理阵列控制模块11输出的控制指令从本单元的1比特临时寄存器34的输出或1比特立即数0和1中选择一个作为1比特算术逻辑运算单元33的第二操作数。In the above solution, the pixel-level parallel processing unit PE8 includes a first operand selector 31, a second operand selector 32, a 1-bit arithmetic logic operation unit 33, a 1-bit temporary data register 34 and a bit-plane random access memory 35, Wherein: the first operand selector 31 selects one from the output of the bit plane memory 35 of this unit or adjacent processing units according to the control instruction output by the processing array control module 11 as the first operation of the 1-bit ALU 33 number; the second operand selector 32 selects one from the output of the 1-bit temporary register 34 of this unit or the 1-bit immediate number 0 and 1 according to the control instruction output by the processing array control module 11 as a 1-bit ALU The second operand of 33.

上述方案中，所述1比特算术逻辑运算单元33包括：一个全加器、一个非门、一个二输入与门、一个二输入或门、一个进位寄存器以及一个输出结果选择器；其中，所述进位寄存器用于寄存加法运算产生的进位结果，该进位结果用于多比特算术运算，所述进位寄存器能够被所述处理阵列控制模块11输出的控制指令清零；所述输出结果选择器根据所述处理阵列控制模块11输出的控制指令从全加器、非门、与门、或门计算的输出中选择一个作为1比特算术逻辑运算单元33的结果。In the above scheme, the 1-bit ALU 33 includes: a full adder, a NOT gate, a two-input AND gate, a two-input OR gate, a carry register and an output result selector; wherein the The carry register is used to store the carry result generated by the addition operation, and the carry result is used for multi-bit arithmetic operations. The carry register can be cleared by the control instruction output by the processing array control module 11; The control instruction output by the processing array control module 11 selects one of the outputs calculated by the full adder, the NOT gate, the AND gate, and the OR gate as the result of the 1-bit arithmetic logic operation unit 33 .

上述方案中，所述位平面随机存储器35是数据位宽为1比特、支持同时读写的小容量随机存储器，其读写地址来自所述处理阵列控制模块11输出的控制指令，其写入数据来自1比特算术逻辑运算单元33的输出，其读出数据作为本单元或邻近处理单元的第一操作数选择器的输入之一。In the above scheme, the bit-plane random access memory 35 is a small-capacity random access memory with a data bit width of 1 bit and supports simultaneous reading and writing. Output from 1-bit ALU 33, which reads data as one of the inputs to the first operand selector of this unit or an adjacent processing unit.

上述方案中，所述处理阵列控制模块11输出的控制指令能够选择将1比特算术逻辑运算单元33的每次输出结果数据写入到所述位平面随机存储器35还是所述1比特临时寄存器34，每次必须且只能写入其中之一。In the above solution, the control instruction output by the processing array control module 11 can select whether to write each output result data of the 1-bit ALU 33 into the bit-plane random access memory 35 or the 1-bit temporary register 34, Only one of them must be written at a time.

上述方案中，所述M×1行处理单元阵列9包含M个一维排列的行并行处理单元RP 10，所有行并行处理单元RP 10工作在单指令多数据(SIMD)模式下，接受相同的RP阵列控制指令，执行相同的操作，但是所操作的数据来自各个单元本地的寄存器；所述每个行并行处理单元RP10用于完成k-bit的算术操作，包括加法、减法、求绝对值、数据移位、以及比较大小，大于k-bit的数据操作能够被分解为若干个小于k-bit的操作串行来完成。In the above scheme, the M × 1 row processing unit array 9 includes M one-dimensionally arranged row parallel processing units RP 10, and all row parallel processing units RP 10 work in single instruction multiple data (SIMD) mode and accept the same The RP array control instruction performs the same operation, but the operated data comes from the local registers of each unit; the parallel processing unit RP10 of each row is used to complete k-bit arithmetic operations, including addition, subtraction, absolute value, Data shifting and size comparison, data operations larger than k-bit can be decomposed into several serial operations smaller than k-bit to complete.

上述方案中，所述每个行并行处理单元RP 10对应于所述M×M像素级并行处理单元阵列7中同一行的所有像素级并行处理单元PE 8，该行每个像素级并行处理单元PE 8的数据能够逐个进入行并行处理单元RP 10被进一步操作。In the above scheme, each row of parallel processing units RP 10 corresponds to all pixel-level parallel processing units PE 8 of the same row in the M×M pixel-level parallel processing unit array 7, and each pixel-level parallel processing unit of this row The data of PE 8 can enter row parallel processing unit RP 10 one by one to be further manipulated.

上述方案中，所述每个行并行处理单元RP均能够与其上下方的行并行处理单元RP进行数据交互，其中有些行并行处理单元RP还能够与相隔其上下方S行的行并行处理单元RP进行数据交互，这些行并行处理单元RP被称为跳跃行处理单元，除这些跳跃行处理单元之外的行并行处理单元RP被称为普通行处理单元；整个行处理单元阵列中，从第一行开始，每隔S行放置一个跳跃行处理单元，其余各行均放置普通行处理单元；其中S为自然数。In the above solution, each row parallel processing unit RP can perform data interaction with the row parallel processing unit RP above and below it, and some row parallel processing units RP can also communicate with the row parallel processing unit RP separated by S rows above and below it. For data interaction, these row parallel processing units RP are called skip row processing units, and row parallel processing units RP other than these skip row processing units are called ordinary row processing units; in the entire row processing unit array, starting from the first At the beginning of the row, place a jump row processing unit every S rows, and place normal row processing units in the remaining rows; where S is a natural number.

上述方案中，所述跳跃行处理单元能够远距离直接进行数据交互，不需逐个通过所有行并行处理单元RP 10进行数据交互，能够实现快速灵活的行间广域处理。In the above solution, the skipping row processing unit can directly perform data interaction at a long distance without going through all row parallel processing units RP 10 for data interaction one by one, and can realize fast and flexible inter-row wide-area processing.

上述方案中，所述行并行处理单元RP包括：一个k-bit缓冲移位寄存器41，用于实现与所述M×M像素级并行处理单元阵列7的串并/并串数据转换，并作为阵列外部片上总线对所述M×1行处理单元阵列9的数据访问接口，同时可被其所属RP单元的寄存器文件的读出数据所更新；一个k-bit第一操作数选择器42，用于根据所述处理阵列控制模块11输出的控制指令从本单元或邻近行处理单元的寄存器文件输出、本单元缓冲移位寄存器的输出中选择一个作为所述k-bit算术运算单元44的第一操作数；一个k-bit第二操作数选择器43，用于根据所述处理阵列控制模块11输出的控制指令从本单元临时寄存器输出或来自阵列控制指令的立即数中选择一个作为所述k-bit算术运算单元44的第二操作数；一个k-bit算术运算单元44，用于执行广域处理和非线性处理，该广域处理包括k-bit加法、减法、求绝对值、数据移位和大小比较；一个条件选择器45，用于根据所述处理阵列控制模块11输出的控制指令从本单元所在行的像素级并行处理单元PE 8输出的1bit数据、来自k-bit算术运算单元44的条件标志寄存器以及1bit常数1中选择一个作为条件运算使能信号，该信号将使能所述k-bit三态缓冲门46；一个k-bit三态缓冲门46，用于接收k-bit算术运算单元44的输出结果，在条件选择器45所输出条件使能信号的控制下决定是否将本次操作的数据写入k-bit临时寄存器47或k-bit位宽的寄存器文件48，以实现条件运算；以及一个k-bit临时寄存器47和一个k-bit位宽的寄存器文件48。In the above scheme, the row parallel processing unit RP includes: a k-bit buffer shift register 41, which is used to realize the serial-parallel/parallel-serial data conversion with the M×M pixel-level parallel processing unit array 7, and serve as Array external on-chip bus to the data access interface of the M × 1 row processing unit array 9, can be updated by the read data of the register file of the RP unit to which it belongs; a k-bit first operand selector 42, with According to the control command output by the processing array control module 11, one is selected from the output of the register file of the unit or the adjacent row processing unit, and the output of the buffer shift register of the unit as the first one of the k-bit arithmetic operation unit 44. Operand: a k-bit second operand selector 43, used to select one as the k from the temporary register output of this unit or from the immediate value of the array control instruction according to the control instruction output by the processing array control module 11 -the second operand of the bit arithmetic operation unit 44; a k-bit arithmetic operation unit 44 is used to perform wide-area processing and nonlinear processing, and the wide-area processing includes k-bit addition, subtraction, absolute value, data shift Bit and size comparison; a condition selector 45, for the 1bit data output from the pixel-level parallel processing unit PE 8 output of the row where the unit is located, from the k-bit arithmetic operation unit according to the control instruction output by the processing array control module 11 Select one of the condition flag registers of 44 and the 1bit constant 1 as the conditional operation enabling signal, which will enable the k-bit tri-state buffer gate 46; a k-bit tri-state buffer gate 46 for receiving k- The output result of the bit arithmetic operation unit 44 determines whether the data of this operation is written into the register file 48 of the k-bit temporary register 47 or the k-bit bit width under the control of the condition enable signal output by the condition selector 45, to realize conditional operation; and a k-bit temporary register 47 and a k-bit wide register file 48.

上述方案中，所述k-bit缓冲移位寄存器41能够在阵列控制指令下按比特进行左右移位，以实现与所述M×M像素级并行处理单元阵列7的串并/并串数据转换；还能够在阵列外部信号控制下，与所述行并行处理单元RP 10上下方单元中的缓冲移位寄存器所有比特并行上下移位，以实现阵列外部片上总线对所述M×1行处理单元阵列9的数据访问；该k-bit缓冲移位寄存器41的输出作为k-bit第一操作数选择器42的输入之一，其值也能被寄存器文件的读出数据所更新。In the above solution, the k-bit buffer shift register 41 can be shifted left and right by bit under the array control instruction, so as to realize the serial-parallel/parallel-serial data conversion with the M×M pixel-level parallel processing unit array 7 It can also be shifted up and down in parallel with all bits of the buffer shift register in the upper and lower units of the parallel processing unit RP 10 of the row under the control of the external signal of the array, so as to realize the array external on-chip bus to the M × 1 row processing unit Data access of the array 9; the output of the k-bit buffer shift register 41 is used as one of the inputs of the k-bit first operand selector 42, and its value can also be updated by the read data of the register file.

上述方案中，所述k-bit第一操作数选择器42在根据控制指令从本单元或邻近行处理单元的寄存器文件输出、本单元缓冲移位寄存器的输出中选择时，如果本单元为跳跃行处理单元，则其选择范围还包括与其相隔S行的跳跃行处理单元。In the above scheme, when the k-bit first operand selector 42 selects from the output of the register file of the unit or the adjacent line processing unit and the output of the buffer shift register of the unit according to the control instruction, if the unit is a jump row processing unit, the selection range also includes skip row processing units separated by S rows.

上述方案中，所述k-bit算术运算单元44还根据每次运算结果更新其内部的“进位/借位”以及“结果为零”标志寄存器，便于大于k-bit的数据运算以及条件运算；其标志寄存器能够被处理阵列控制模块输出的控制指令清零。In the above scheme, the k-bit arithmetic operation unit 44 also updates its internal "carry/borrow" and "result is zero" flag registers according to the result of each operation, so as to facilitate data operations and conditional operations greater than k-bit; Its flag register can be cleared by processing the control command output by the array control module.

上述方案中，所述k-bit位宽的寄存器文件48为数据位宽k-bit、支持同时读写的小容量随机存储器或寄存器堆，其读写地址来自所述处理阵列控制模块11输出的控制指令，其写入数据来自k-bit三态缓冲门46的输出，其读出数据作为本单元或邻近行处理单元的k-bit第一操作数选择器42的输入之一；如果本单元为跳跃行处理单元，则还包括与其相隔S行的跳跃行处理单元。In the above scheme, the register file 48 of the k-bit bit width is a small-capacity random access memory or a register file with a data bit width k-bit and supports simultaneous reading and writing, and its read-write address comes from the output of the processing array control module 11 Control command, its writing data is from the output of k-bit tri-state buffer gate 46, and its reading data is as one of the input of the k-bit first operand selector 42 of this unit or adjacent line processing unit; If this unit If it is a skip row processing unit, it also includes a skip row processing unit separated by S rows.

上述方案中，所述处理阵列控制模块11输出的控制指令用于选择将所述k-bit算术运算单元44的每次输出结果数据写入到k-bit临时寄存器47或k-bit位宽的寄存器文件48，当所述k-bit三态缓冲门46被使能时必须且只能写入其中之一。In the above scheme, the control instruction output by the processing array control module 11 is used to select to write each output result data of the k-bit arithmetic operation unit 44 into the k-bit temporary register 47 or the k-bit wide Register files 48, when the k-bit tri-state buffer gate 46 is enabled, must and can only be written to one of them.

上述方案中，所述条件选择器45能够直接来自像素级并行处理单元PE8的1bit数据作为条件使能信号，不需经过基于所述k-bit缓冲移位寄存器41的串并转换，有利于实现灵活快速的行内广域处理。In the above scheme, the condition selector 45 can directly come from the 1-bit data of the pixel-level parallel processing unit PE8 as a condition enable signal, without going through the serial-to-parallel conversion based on the k-bit buffer shift register 41, which is beneficial to realize Flexible and fast in-line wide-area processing.

上述方案中，当所述M×1行处理单元阵列9完成较复杂的算法而寄存器文件的存储空间不够时，能够将数据通过所述k-bit缓冲移位寄存器41存入所述M×M像素级并行处理单元阵列7中；当所述M×1行处理单元阵列9所有操作完成时，能够将结果数据写入所述k-bit缓冲移位寄存器41，再由阵列外部片上总线16读走。In the above scheme, when the M×1 line processing unit array 9 completes a relatively complex algorithm and the storage space of the register file is not enough, the data can be stored in the M×M through the k-bit buffer shift register 41. In the pixel-level parallel processing unit array 7; when all operations of the M×1 row processing unit array 9 are completed, the resulting data can be written into the k-bit buffer shift register 41, and then read by the array external on-chip bus 16 Walk.

上述方案中，所述处理阵列控制模块11从变长SIMD存储器内部读取指令片段的位置由片上总线16动态配置，且当该段指令执行完成后生成完成标志报告给片上总线16。In the above solution, the location where the processing array control module 11 reads the instruction segment from the variable-length SIMD memory is dynamically configured by the on-chip bus 16, and generates a completion flag to report to the on-chip bus 16 after the execution of the segment of the instruction is completed.

上述方案中，为了既支持所述M×M像素级并行处理单元阵列7和所述M×1行处理单元阵列9的协同操作，又减少所需片上指令存储空间，该视觉图像处理系统采取变长SIMD指令机制，其中变长SIMD指令存储器每个地址上都存储了一条2L-bit指令字，根据指令字头能够区分这是一条控制所述M×M像素级并行处理单元阵列7和所述M×1行处理单元阵列9协同工作的2L-bit超长SIMD指令，还是控制所述M×M像素级并行处理单元阵列7和所述M×1行处理单元阵列9单独工作的两条L-bit普通SIMD指令；所述处理阵列控制模块11内嵌有变长SIMD指令的调度和译码功能单元。In the above solution, in order to support the coordinated operation of the M×M pixel-level parallel processing unit array 7 and the M×1 row processing unit array 9, and reduce the required on-chip instruction storage space, the visual image processing system adopts variable Long SIMD instruction mechanism, in which a 2L-bit instruction word is stored in each address of the variable-length SIMD instruction memory, which can be distinguished according to the instruction word header. The 2L-bit ultra-long SIMD instruction that M×1 row processing unit array 9 works together, or the two L that control the M×M pixel-level parallel processing unit array 7 and the M×1 row processing unit array 9 work independently -bit ordinary SIMD instruction; the processing array control module 11 is embedded with a scheduling and decoding functional unit for variable-length SIMD instructions.

上述方案中，所述片上可配置人工神经网络12包括：输入神经元向量寄存器组51，包括T1个输入神经元寄存器，其中每个输入神经元寄存器用于存储J1比特定点数据，其中T1＜＜M；神经元广播器52，用于接受所述输入神经元向量寄存器组51的数据，并每次选择其中一个广播到并行运算单元阵列53，作为并行运算单元阵列53中各个并行运算单元的操作数之一；并行运算单元阵列53，包含T2个并行运算单元，T2≤T1，每个并行运算单元接受所述神经元广播器52广播的输入神经元作为第一个操作数，同时分别接收权重/阈值存储器55每个地址上的T2个权重/阈值数据作为第二个操作数，其中权重/阈值为J比特定点数据，J＞J1；输出神经元向量寄存器组54，包括T2个输出神经元寄存器，其中每个输出神经元寄存器存储J2比特定点数据；权重/阈值存储器55，其中存有运算过程所需的权重和阈值数据，每个地址上有T2个J比特定点数据；神经网络控制模块56，用于根据配置的参数信息控制整个片上可配置人工神经网络12的并行运算过程，片上可配置人工神经网络12正常工作时存储器地址由神经网络控制模块56给出；总线读写接口57，用于片上可配置人工神经网络12中的输入神经元向量寄存器组51、输出神经元向量寄存器组54、权重/阈值存储器55中的数据被外部写入和读出；并行运算单元中的分段线性映射单元的映射函数和神经网络控制模块56的控制参数也由该总线读写接口57灵活配置。In the above solution, the on-chip configurable artificial neural network 12 includes: an input neuron vector register group 51, including T1 input neuron registers, wherein each input neuron register is used to store J1 specific point data, where T1<< M; neuron broadcaster 52, used to accept the data of the input neuron vector register group 51, and select one of them to broadcast to the parallel operation unit array 53 at a time, as the operation of each parallel operation unit in the parallel operation unit array 53 One of the numbers; the parallel computing unit array 53 includes T2 parallel computing units, T2≤T1, each parallel computing unit accepts the input neuron broadcast by the neuron broadcaster 52 as the first operand, and simultaneously receives the weights respectively T2 weight/threshold data on each address of the threshold memory 55 is used as the second operand, wherein the weight/threshold is J ratio specific point data, J>J1; output neuron vector register group 54, including T2 output neurons Registers, wherein each output neuron register stores J2 ratio specific point data; weight/threshold memory 55, wherein there are weights and threshold data required for the operation process, T2 J ratio specific point data are arranged on each address; neural network control module 56, used to control the parallel operation process of the entire on-chip configurable artificial neural network 12 according to the configured parameter information, and the memory address of the on-chip configurable artificial neural network 12 is given by the neural network control module 56 when the on-chip configurable artificial neural network 12 works normally; the bus read-write interface 57, The data in the input neuron vector register group 51, the output neuron vector register group 54, and the weight/threshold memory 55 in the on-chip configurable artificial neural network 12 are externally written and read; the segmentation in the parallel operation unit The mapping function of the linear mapping unit and the control parameters of the neural network control module 56 are also flexibly configured by the bus read-write interface 57 .

上述方案中，所述每个并行运算单元包括定点乘法器、累加寄存器和分段线性映射单元，其中，所述定点乘法器和所述累加寄存器用于完成输入神经元数据与相应权重因子/阈值的乘累加运算，所述累加寄存器能够被神经网络控制模块清零，所述分段线性映射单元用于实现激活转移函数，其输出用于更新所述输出神经元向量寄存器组54。In the above solution, each of the parallel operation units includes a fixed-point multiplier, an accumulation register and a piecewise linear mapping unit, wherein the fixed-point multiplier and the accumulation register are used to complete the input neuron data and the corresponding weight factor/threshold The multiplication and accumulation operation, the accumulation register can be cleared by the neural network control module, the piecewise linear mapping unit is used to implement the activation transfer function, and its output is used to update the output neuron vector register set 54 .

上述方案中，在所述神经网络控制模块56的控制下，所述神经元广播器52每次广播一个输入神经元到所述并行运算单元阵列53，同时从所述权重/阈值存储器55中取出与被广播的输入神经元对应的权重/阈值数据到所述并行运算单元阵列53，经过各个并行运算单元的乘法器相乘后累加到累加寄存器，全部完成后再并行实施分段线性映射，将最终结果归一化为T2比特后送入所述输出神经元向量寄存器组54。In the above scheme, under the control of the neural network control module 56, the neuron broadcaster 52 broadcasts one input neuron to the parallel operation unit array 53 at a time, and fetches the neuron from the weight/threshold memory 55 at the same time. The weight/threshold data corresponding to the broadcasted input neuron is sent to the parallel operation unit array 53, multiplied by the multipliers of each parallel operation unit and then accumulated to the accumulation register, and then the segmented linear mapping is implemented in parallel after all are completed, and the The final result is normalized to T2 bits and sent to the output neuron vector register set 54 .

上述方案中，所述写入权重/阈值存储器55的数据和配置并行运算单元及神经网络控制模块56的数据是根据对神经网络的训练结果得到的，训练过程是在精简指令处理器双核子系统13或者系统外部通用处理器上实现。In the above scheme, the data written into the weight/threshold memory 55 and the data configured with the parallel computing unit and the neural network control module 56 are obtained according to the training results of the neural network, and the training process is performed in the simplified instruction processor dual-core subsystem 13 or implemented on a general-purpose processor outside the system.

上述方案中，所述片上可配置人工神经网络12支持最大T1个输入神经元，最大T2个输出神经元，且T2≤T1，当输入神经元数目小于T1、或输出神经元数目小于T2时，剩余的输入神经元寄存器、输出神经元寄存器和权重/阈值存储器中对应的数据将被自动置为0。In the above solution, the on-chip configurable artificial neural network 12 supports a maximum of T1 input neurons, a maximum of T2 output neurons, and T2≤T1, when the number of input neurons is less than T1, or the number of output neurons is less than T2, The corresponding data in the remaining input neuron registers, output neuron registers and weight/threshold memory will be automatically set to 0.

上述方案中，所述输出神经元寄存器的数据由片上总线16读出，并再次输入到输入神经元寄存器，实现多层神经网络的计算。In the above solution, the data of the output neuron register is read out by the on-chip bus 16, and input to the input neuron register again, so as to realize the calculation of the multi-layer neural network.

上述方案中，所述精简指令处理器双核子系统13包括1号精简指令处理器核(RISC#1)、1号RISC私有程序/数据存储器、2号精简指令处理器核(RISC#2)、2号RISC私有程序/数据存储器、处理器核间通信信箱和处理器仲裁器，其中：该精简指令处理器双核子系统13的1号精简指令处理器核(RISC#1)和2号精简指令处理器核(RISC#2)分别具有P比特数据位宽的私有程序/数据存储器，以实现线程级并行的处理，用于负责高级图像处理中除正常特征识别以外的不规则处理以及对整个系统的控制。In the above scheme, the RISC dual-core subsystem 13 includes No. 1 RISC core (RISC#1), No. 1 RISC private program/data memory, No. 2 RISC core (RISC#2), No. 2 RISC private program/data memory, communication mailbox between processor cores and processor arbitrator, wherein: No. 1 RISC processor core (RISC#1) and No. 2 RISC of the RISC dual-core subsystem 13 The processor core (RISC#2) has a private program/data memory of P-bit data width to achieve thread-level parallel processing, which is responsible for irregular processing in advanced image processing other than normal feature recognition and for the entire system control.

上述方案中，所述1号精简指令处理器核(RISC#1)和2号精简指令处理器核(RISC#2)之间利用所述处理器核间通信信箱进行通信以实现必要的线程同步和数据交换；所述1号精简指令处理器核(RISC#1)和2号精简指令处理器核(RISC#2)对片上总线的访问权通过所述处理器仲裁器控制，该处理器仲裁器在硬件上支持固定优先级和先来先服务两种仲裁方式；所述处理器核间通信信箱为同步双向FIFO。In the above scheme, the No. 1 RISC #1 and No. 2 RISC cores (RISC #2) use the inter-processor core communication mailbox for communication to achieve necessary thread synchronization and data exchange; the No. 1 reduced instruction processor core (RISC#1) and the No. 2 reduced instruction processor core (RISC#2) have access rights to the on-chip bus controlled by the processor arbiter, which arbitrates The processor supports two arbitration methods of fixed priority and first-come-first-served in hardware; the communication mailbox between processor cores is a synchronous bidirectional FIFO.

上述方案中，所述精简指令处理器双核子系统13还根据所述M×M像素级并行处理单元阵列7和所述M×1行处理单元阵列9进行处理所获得的宏观图像信息或感兴趣目标范围动态调整所述图像传感器控制模块6的参数寄存器中的数据，以自适应不断变化的应用环境，以及满足本系统或目标在环境中的相对运动所带来的多分辨率处理需求。In the above scheme, the RISC dual-core subsystem 13 also processes the macroscopic image information or interest obtained by processing the M×M pixel-level parallel processing unit array 7 and the M×1 row processing unit array 9. The target range dynamically adjusts the data in the parameter register of the image sensor control module 6 to adapt to the changing application environment and meet the multi-resolution processing requirements brought by the system or the relative movement of the target in the environment.

上述方案中，所述随机/顺序混合I/O存储器14为一双端口存储器，其中一个端口为P比特位宽，可由片上总线进行随机读写访问，另一端口为PS(PS＜P)比特位宽，由片外器件进行顺序读写访问，且读写相互独立；片外进行顺序读写时的使能信号能被该存储器内嵌的地址生成模块自动映射成该存储器的物理地址；该物理地址能被外部重定向清零。In the above scheme, the random/sequence mixed I/O memory 14 is a dual-port memory, wherein one port is a P bit width, which can be accessed randomly by the on-chip bus, and the other port is a PS (PS<P) bit Wide, sequential read and write access is performed by off-chip devices, and the read and write are independent of each other; the enable signal for sequential read and write outside the chip can be automatically mapped to the physical address of the memory by the address generation module embedded in the memory; the physical The address can be cleared by external redirection.

上述方案中，所述系统线程标志15为W比特寄存器，其中某些比特由系统内部的片上总线16负责控制写入，而另外一些比特则由系统外部器件负责控制写入；系统内外均可读标志寄存器的所有比特。In the above scheme, the system thread flag 15 is a W bit register, wherein some bits are controlled by the on-chip bus 16 inside the system to control the writing, while other bits are controlled by the external devices of the system to control the writing; both inside and outside the system are readable All bits of the flags register.

上述方案中，所述片上总线16将来自所述精简指令处理器双核子系统13的读写控制信号和逻辑地址信息映射到其他各个总线从器件模块所需的选通使能信号和物理地址信息时，所述器件模块包括图像传感器控制模块、处理阵列控制模块、片上人工神经网络、随机/顺序混合I/O存储器、以及系统线程标志。In the above scheme, the on-chip bus 16 maps the read and write control signals and logical address information from the RISC dual-core subsystem 13 to the strobe enable signals and physical address information required by other bus slave modules When, the device module includes an image sensor control module, a processing array control module, an on-chip artificial neural network, a random/sequential mixed I/O memory, and a system thread flag.

上述方案中，在该视觉图像处理系统中，由图像传感器获得的数字像素数据以行并行方式载入到所述M×M像素级并行处理单元阵列7中，在所述M×M像素级并行处理单元阵列7和所述M×1行处理单元阵列9的协同配合下灵活完成各种低、中级图像处理，提取出图像特征送入片上人工神经网络12进行特征识别，以及还由精简指令处理器双核子系统13做进一步分析处理，得到最终所需的少量结果数据并输出。In the above solution, in the visual image processing system, the digital pixel data obtained by the image sensor is loaded into the M×M pixel-level parallel processing unit array 7 in a row-parallel manner, and the M×M pixel-level parallel processing unit array 7 is Under the cooperation of the processing unit array 7 and the M×1 row processing unit array 9, various low-level and intermediate-level image processing can be flexibly completed, and image features are extracted and sent to the on-chip artificial neural network 12 for feature recognition, and also processed by simplified instructions. The processor dual-core subsystem 13 performs further analysis and processing to obtain and output a small amount of final required result data.

(三)有益效果(3) Beneficial effects

从上述技术方案可以看出，本发明具有以下有益效果：As can be seen from the foregoing technical solutions, the present invention has the following beneficial effects:

1、本发明提供的基于可编程视觉芯片的视觉图像处理系统，像素单元阵列和处理电路分离，彻底解决了传统视觉芯片面积随分辨率迅速增长、分辨率和填充率过低限制原始图像质量的难题，而且处理单元PE和像素单元灵活的对应关系有助于实现灵活的多分辨率处理。1. The visual image processing system based on the programmable visual chip provided by the present invention, the separation of the pixel unit array and the processing circuit completely solves the problem that the area of the traditional visual chip increases rapidly with the resolution, and the resolution and filling rate are too low to limit the quality of the original image. difficult problem, and the flexible correspondence between the processing unit PE and the pixel unit helps to realize flexible multi-resolution processing.

2、本发明提供的基于可编程视觉芯片的视觉图像处理系统，引入基于多级并行数字处理电路以及片上人工神经网络的体系架构，能够通过编程高速、灵活的完成各种低、中、高级图像处理，真正实现了单芯片片上视觉图像处理系统，丰富和扩展了视觉芯片在各种对体积、功耗、成本有严格限制的嵌入式场合的应用。2. The visual image processing system based on the programmable visual chip provided by the present invention introduces an architecture based on multi-level parallel digital processing circuits and on-chip artificial neural networks, and can quickly and flexibly complete various low, medium and high-level images through programming Processing, truly realize the single-chip on-chip visual image processing system, enrich and expand the application of visual chips in various embedded occasions with strict restrictions on volume, power consumption and cost.

3、本发明提供的基于可编程视觉芯片的视觉图像处理系统，带有跳跃行处理单元的可编程行处理单元阵列能够实现灵活、快速的广域处理功能，加快了特征提取的速度。3. In the visual image processing system based on the programmable visual chip provided by the present invention, the programmable row processing unit array with skip row processing unit can realize flexible and fast wide-area processing function, and accelerate the speed of feature extraction.

4、本发明提供的基于可编程视觉芯片的视觉图像处理系统，具有高速、灵活的视觉图像实时处理能力，处理速度可以超过1000帧/秒。4. The visual image processing system based on the programmable visual chip provided by the present invention has high-speed, flexible real-time visual image processing capability, and the processing speed can exceed 1000 frames per second.

附图说明 Description of drawings

图1是本发明提供的基于可编程视觉芯片的视觉图像处理系统的结构示意图；Fig. 1 is the structural representation of the visual image processing system based on programmable vision chip provided by the present invention;

图2是图1中图像传感器阵列中连续一行的电路图，包括一行滚动曝光的高速四管像素单元与其后续的行并行模拟处理单元(包括相关双采样电路和可控增益放大电路)及基于循环冗余机制的模数转换电路单元；Fig. 2 is a circuit diagram of a continuous row in the image sensor array in Fig. 1, including a row of rolling exposure high-speed four-tube pixel unit and its subsequent row parallel analog processing unit (including correlated double sampling circuit and controllable gain amplifier circuit) and based on cyclic redundancy The analog-to-digital conversion circuit unit of the redundant mechanism;

图3是图1中像素级并行处理单元阵列中的处理单元PE的电路结构图；Fig. 3 is a circuit structure diagram of a processing unit PE in the pixel-level parallel processing unit array in Fig. 1;

图4是图1中行并行处理单元阵列中的行处理单元RP的电路结构图；Fig. 4 is a circuit structure diagram of the row processing unit RP in the row parallel processing unit array in Fig. 1;

图5是图1中片上可配置人工神经网络的电路结构图；Fig. 5 is a circuit structure diagram of an on-chip configurable artificial neural network in Fig. 1;

图6是基于图1中视觉图像处理系统的1000帧/秒高速目标追踪算法流程图；Fig. 6 is the 1000 frame/second high-speed target tracking algorithm flow chart based on the visual image processing system in Fig. 1;

图7是基于图1中视觉图像处理系统的高速手势识别算法流程图；Fig. 7 is a high-speed gesture recognition algorithm flow chart based on the visual image processing system in Fig. 1;

图8是图7中算法所需识别的四类手势的二值化图像；Fig. 8 is the binarized image of the four types of gestures required to be recognized by the algorithm in Fig. 7;

图9是基于图1中视觉图像处理系统的快速人脸检测的算法示意图。FIG. 9 is a schematic diagram of an algorithm for fast face detection based on the visual image processing system in FIG. 1 .

具体实施方式 Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

如图1所示，图1是本发明提供的基于可编程视觉芯片的视觉图像处理系统的结构示意图，该系统包括图像传感模块和多级并行数字图像处理模块两部分。其中图像传感器模块包括N×N像素阵列，N×1行并行模拟预处理阵列，N×1行并行模数转换单元(ADC)阵列、输出像素选择模块和图像传感器控制模块，在本实施例中，N＝256，可以达到机器视觉等应用中的标准分辨率。多级并行数字图像处理模块包括M×M像素级并行处理单元(PE)阵列、M×1行处理单元(RP)阵列、片上可配置人工神经网络、精简指令处理器双核子系统、随机/顺序混合I/O存储器14、系统线程标志和高速片上总线。在本实施例中，M＝64，配合输出像素选择模块和图像传感器控制模块对感兴趣像素阵列区域和分辨率的灵活选择，可以在较小的芯片面积内实现多分辨率处理。As shown in Figure 1, Figure 1 is a schematic structural diagram of a visual image processing system based on a programmable visual chip provided by the present invention. The system includes two parts: an image sensing module and a multi-stage parallel digital image processing module. Wherein the image sensor module includes an N×N pixel array, an N×1 parallel analog preprocessing array, an N×1 parallel analog-to-digital conversion unit (ADC) array, an output pixel selection module and an image sensor control module, in this embodiment , N=256, which can reach the standard resolution in applications such as machine vision. Multi-level parallel digital image processing module includes M×M pixel-level parallel processing unit (PE) array, M×1 line processing unit (RP) array, on-chip configurable artificial neural network, reduced instruction processor dual-core subsystem, random/sequential Mixed I/O memory 14, system thread flags and high-speed on-chip bus. In this embodiment, M=64, with the flexible selection of the area and resolution of the pixel array of interest by the output pixel selection module and the image sensor control module, multi-resolution processing can be realized in a smaller chip area.

如图2所示是本实施例中一行像素单元极其相应行的模拟预处理单元和模数转换单元(ADC)，其中像素单元采用了标准的四管像素结构，配合模拟预处理单元中的相关双采样电路CDS可消除复位噪声和固定模式噪声，另外模拟预处理单元中的可控增益放大电路PGA可以灵活改变等效反馈电容的大小，以此实现不同的增益，提高图像的动态范围和对比度，最后该模拟图像数据经过小面积循环冗余数模转换ADC单元转换为数字信号，经过输出像素选择模块输入到像素级并行处理单元(PE)阵列开始进行处理。As shown in Figure 2, the analog preprocessing unit and analog-to-digital conversion unit (ADC) of a row of pixel units and corresponding rows in this embodiment, wherein the pixel unit adopts a standard four-tube pixel structure, cooperates with the correlation in the analog preprocessing unit The double-sampling circuit CDS can eliminate reset noise and fixed pattern noise. In addition, the controllable gain amplifier circuit PGA in the analog preprocessing unit can flexibly change the size of the equivalent feedback capacitor to achieve different gains and improve the dynamic range and contrast of the image. , and finally the analog image data is converted into a digital signal by a small-area cyclic redundant digital-to-analog conversion ADC unit, and input to a pixel-level parallel processing unit (PE) array through an output pixel selection module for processing.

图像传感器模块可以通过外部片上总线动态的配置输出像素选择模块和图像传感器控制模块，以实现对像素阵列灵活的行列选择，即在不同的亚采样分辨率下选择不同的感兴趣区域，输入到PE阵列极其后续处理模块进行处理，实现多分辨率处理。另外图像传感器模块还可以通过外部片上总线动态的配置图像传感器控制模块的参数寄存器，以实现不同的积分曝光时间、不同的帧率和不同的PGA增益等等，实时根据应用环境和算法的需要调整图像传感器的工作方式。The image sensor module can dynamically configure the output pixel selection module and image sensor control module through the external on-chip bus to achieve flexible row and column selection of the pixel array, that is, to select different regions of interest under different sub-sampling resolutions, and input them to the PE The array and its subsequent processing modules are processed to realize multi-resolution processing. In addition, the image sensor module can also dynamically configure the parameter registers of the image sensor control module through the external on-chip bus to achieve different integral exposure times, different frame rates and different PGA gains, etc., and adjust in real time according to the needs of the application environment and algorithms How image sensors work.

如图3所示是本实施例中像素级并行处理单元PE的具体电路结构图。每个PE单元包括第一操作数选择器、第二操作数选择器、1bit算术逻辑运算单元(ALU)、1bit临时数据寄存器和位平面随机存储器。第一操作数选择器用于选择来自本单元或是邻近(上下左右)四个PE单元中位平面存储器的输出数据作为本单元1bit ALU的第一个操作数，这样就实现了邻近PE单元之间的数据交互，而第二操作数用于选择本单元1bit临时寄存器中的数据或是常数0、1作为本单元1bit ALU的第二个操作数。单元中1bit ALU可以完成1bit的与、或、非逻辑运算和1bit加法运算，多比特的加法、减法、乘法等较复杂的算术运算可以通过分解为多次串行1bit加法来实现，其中ALU中的进位寄存器用于寄存进位标志。位平面存储器的输入输出数据都为1bit，多比特灰度图像数据按位(bit)存储在该存储器中，因此占据多个地址。在本实施例中，该位平面存储器容量为64bit，可以满足绝大多应用中低、中级图像处理的数据存储需求。PE单元ALU处理的结果可以存入临时寄存器或该位平面存储器中。FIG. 3 is a specific circuit structure diagram of the pixel-level parallel processing unit PE in this embodiment. Each PE unit includes a first operand selector, a second operand selector, a 1-bit arithmetic logic operation unit (ALU), a 1-bit temporary data register, and a bit-plane random access memory. The first operand selector is used to select the output data from the bit plane memory in this unit or the four adjacent (upper, lower, left, right) PE units as the first operand of the 1bit ALU of this unit, thus realizing the connection between adjacent PE units The data interaction, and the second operand is used to select the data in the 1bit temporary register of the unit or the constant 0, 1 as the second operand of the 1bit ALU of the unit. The 1-bit ALU in the unit can complete 1-bit AND, OR, non-logic operations, and 1-bit addition operations. More complex arithmetic operations such as multi-bit addition, subtraction, and multiplication can be realized by decomposing into multiple serial 1-bit additions. Among them, the ALU The carry register is used to store the carry flag. The input and output data of the bit plane memory are both 1 bit, and the multi-bit grayscale image data is stored in the memory bit by bit, thus occupying multiple addresses. In this embodiment, the bit-plane memory has a capacity of 64 bits, which can meet the data storage requirements of low-level and medium-level image processing in most applications. The results of PE unit ALU processing can be stored in temporary registers or in the bit plane memory.

PE阵列的指令存储在处理阵列控制模块中，它在处理阵列控制模块的控制下以单指令多数据(SIMD)模式并行工作，每个时钟周期可执行一条指令。The instructions of the PE array are stored in the processing array control module, and it works in parallel under the control of the processing array control module in the single instruction multiple data (SIMD) mode, and each clock cycle can execute one instruction.

像素级并行处理单元(PE)阵列主要用于低、中级图像处理中的局域线性运算，包括背景减除、线性灰度变换、平滑滤波、边缘检测、阈值分割、二值形态学，等等，这些运算都可以像素级并行高速完成，相比串行处理的加速比为O(M×M)。The pixel-level parallel processing unit (PE) array is mainly used for local linear operations in low-level and mid-level image processing, including background subtraction, linear grayscale transformation, smoothing filtering, edge detection, threshold segmentation, binary morphology, etc. , these operations can be completed in parallel at pixel level at high speed, and the acceleration ratio compared to serial processing is O(M×M).

如图4所示为本实施例中行并行处理单元RP的具体电路结构图，。每个RP单元包括一个k-bit缓冲移位寄存器、一个k-bit第一操作数选择器、一个k-bit第二操作数选择器、一个k-bit算术运算单元、一个条件选择器、一个k-bit三态缓冲门、一个k-bit临时寄存器和一个k-bit位宽的寄存器文件。在本实施例中，k＝8，这是因为灰度图像数据一般为8bit，因此k＝8可以实现较好的性能-面积平衡。RP单元中的各个缓冲移位寄存器组成了移位寄存器阵列，可以支持字并位串的左右移位以实现各个RP单元和相同行PE单元的数据交互，也可以支持字串位并的上下移位以实现阵列外部的总线与RP阵列的数据读写交互。FIG. 4 is a specific circuit structure diagram of the row parallel processing unit RP in this embodiment. Each RP unit includes a k-bit buffer shift register, a k-bit first operand selector, a k-bit second operand selector, a k-bit arithmetic operation unit, a condition selector, a A k-bit tri-state buffer gate, a k-bit temporary register and a k-bit wide register file. In this embodiment, k=8, because the grayscale image data is generally 8 bits, so k=8 can achieve a better performance-area balance. Each buffer shift register in the RP unit forms a shift register array, which can support the left and right shift of the word parallel bit string to realize the data interaction between each RP unit and the PE unit of the same row, and can also support the up and down shift of the word string bit string Bits to realize the data read and write interaction between the bus outside the array and the RP array.

RP单元分为跳跃RP单元和普通RP单元，每种RP单元都可以与邻近(上下)RP单元通过第一操作数选择器对数据输入的选择来实现交互，但跳跃RP单元的第一操作数选择器还可以选择与本单元相隔S行的RP单元数据来实现“跳跃”交互，这些跳跃RP单元在RP阵列中每隔S行放置一个，所有的RP单元组成了跳跃链，以加速某些统计类广域操作。在本实施例中S＝8，因为可以通过简单的理论推导得出S的最佳值为M的平方根。RP单元ALU的第二操作数来自本单元临时寄存器或是RP阵列指令中的立即数域，其ALU可以通过硬件完成8bit数据的求最大/最小值、加法/减法、数据移位和求绝对值，并生成标志位以利于下一周期的条件操作。条件选择器选择该ALU的标志位或者来自该行PE单元的1bit数据(通常为二值图像数据或某些标志数据)控制三态缓冲门以实现条件写入，这样就实现了RP单元的条件操作。临时寄存器和寄存器文件用于存储RP单元处理过程中的数据。由于RP阵列的输出一般不再是图像数据，而是图像中的某些特征，因此所需存储量较小，当RP单元确实需要较大存储容量时，可以通过缓冲寄存器的串并转换和本行PE单元共享位平面存储器的存储空间。在本实施例中，RP单元寄存器文件的存储容量为8bit×16。RP units are divided into jumping RP units and ordinary RP units. Each type of RP unit can interact with adjacent (upper and lower) RP units through the selection of data input by the first operand selector, but the first operand of the jumping RP unit The selector can also select RP unit data that is separated from this unit by S rows to realize "jumping" interaction. These jumping RP units are placed every S rows in the RP array, and all RP units form a skipping chain to speed up some Statistical wide-area operations. In this embodiment, S=8, because the optimal value of S can be derived from the square root of M through simple theoretical derivation. The second operand of the RP unit ALU comes from the temporary register of the unit or the immediate field in the RP array instruction, and its ALU can complete the maximum/minimum value, addition/subtraction, data shift and absolute value calculation of 8bit data through hardware , and generate a flag bit to facilitate the conditional operation of the next cycle. The condition selector selects the flag bit of the ALU or the 1bit data from the row of PE units (usually binary image data or some flag data) to control the tri-state buffer gate to achieve conditional writing, thus realizing the condition of the RP unit operate. Temporary registers and register files are used to store data during RP unit processing. Since the output of the RP array is generally no longer image data, but some features in the image, the required storage capacity is relatively small. The row PE units share the storage space of the bit-plane memory. In this embodiment, the storage capacity of the RP unit register file is 8bit×16.

RP单元主要用于低、中级图像处理中适合以行并行方式完成的广域运算和非线性运算，包括中值滤波、灰度形态学算法、平均灰度计算、形状特征提取(比如面积、周长、目标区域矩形限定框)，等等，这些运算的主要步骤可以行并行方式完成，相比串行处理的加速比为O(M)。The RP unit is mainly used for wide-area operations and nonlinear operations that are suitable for row-parallel processing in low- and intermediate-level image processing, including median filtering, gray-scale morphology algorithms, average gray-scale calculations, and shape feature extraction (such as area, circumference, etc.). length, target area rectangular bounding box), etc., the main steps of these operations can be done in parallel, and the speedup ratio is O(M) compared to serial processing.

RP阵列也工作在单指令多数据(SIMD)方式下，其指令同样来自处理阵列控制模块，每个时钟周期可执行一条指令。另外RP阵列操作时经常需要PE阵列配合，为了既能支持PE阵列和RP阵列的协同工作，又能在只需其中一个阵列单独工作时不浪费指令存储空间，采用了一种变长超长SIMD指令字(Variable VLIW SIMD，VVS)机制，可通过指令头区分这是一条控制PE阵列和RP阵列协同工作的2L-bit超长SIMD指令字还是两条连续的控制PE阵列或RP阵列工作的L-bit普通SIMD指令字。处理阵列控制模块在外部总线所写入控制参数的控制下进行取指，并负责对指令的解释和调度。在本实施例中，L＝32，片上指令存储空间为8KB，满足绝大多数典型应用低、中级图像处理算法的存储需求。The RP array also works in the single instruction multiple data (SIMD) mode, and its instructions also come from the processing array control module, and each clock cycle can execute one instruction. In addition, the RP array often requires the cooperation of the PE array. In order to support the cooperative work of the PE array and the RP array, and not waste instruction storage space when only one of the arrays is required to work alone, a variable-length super-length SIMD is adopted. Instruction word (Variable VLIW SIMD, VVS) mechanism, through the instruction header, it can be distinguished whether it is a 2L-bit ultra-long SIMD instruction word that controls the work of the PE array and the RP array or two consecutive L that control the work of the PE array or the RP array -bit common SIMD instruction word. The processing array control module fetches instructions under the control of the control parameters written in the external bus, and is responsible for explaining and scheduling instructions. In this embodiment, L=32, and the on-chip instruction storage space is 8KB, which meets the storage requirements of most typical applications of low-level and middle-level image processing algorithms.

如图5所示是本实施例中片上可配置人工神经网络的具体结构。该人工神经网络包括输入J1比特精度的神经元寄存器组、神经元广播器、并行运算单元阵列、J2比特精度的输出神经元寄存器组、J比特精度的权重/阈值存储器和神经网络控制模块。其中每个并行运算单元进一步包括：定点乘法器、累加寄存器和分段线性映射单元。As shown in FIG. 5 is the specific structure of the on-chip configurable artificial neural network in this embodiment. The artificial neural network includes an input neuron register set of J1 bit precision, a neuron broadcaster, a parallel operation unit array, an output neuron register set of J2 bit precision, a weight/threshold memory of J bit precision and a neural network control module. Each parallel operation unit further includes: a fixed-point multiplier, an accumulation register and a segmented linear mapping unit.

由RP阵列提取的图像特征数据被片上总线加载到T1个输入神经元寄存器中，在神经网络控制模块的控制下，神经元广播器依次广播各个输入神经元寄存器的数据到并行运算单元阵列作为操作数之一，而另一个操作数来自权重/阈值存储器，该存储器中每一个地址上存储了T2个权重/阈值数据，分别对应T2个并行运算单元，而存储器的地址则由神经网络控制模块给出。并行运算单元将输入神经元寄存器数据和相应的权重数据相乘后累加到累加寄存器中，当T1个输入神经元寄存器数据均被广播处理之后再减去权重/阈值存储器中的阈值信心，最后输入到分段线性映射单元实现转移函数，其结果就是代表识别结果的输出神经元寄存器的值，并最终通过片上总线读出。人工神经网络是以矢量级并行方式来完成特征识别任务的，相比串行处理可获得约为O(T2)的加速比。The image feature data extracted by the RP array is loaded into T1 input neuron registers by the on-chip bus. Under the control of the neural network control module, the neuron broadcaster broadcasts the data of each input neuron register to the parallel operation unit array in turn as an operation One of the numbers, and the other operand comes from the weight/threshold memory, each address in the memory stores T2 weight/threshold data, corresponding to T2 parallel computing units, and the address of the memory is given by the neural network control module out. The parallel operation unit multiplies the input neuron register data and the corresponding weight data and accumulates them in the accumulation register. After the T1 input neuron register data are broadcast and processed, the threshold confidence in the weight/threshold memory is subtracted, and finally input To the piecewise linear mapping unit to implement the transfer function, the result is the value of the output neuron register representing the recognition result, and finally read out through the on-chip bus. The artificial neural network completes the feature recognition task in a vector-level parallel manner, and can obtain an acceleration ratio of about O(T2) compared with serial processing.

输入神经元的有效个数(代表图像特征维数)可以小于T1，同样，输出神经元的有效个数(代表目标识别的分类数)也可以小于T2；当这两种情形发生时，可以通过配置神经网络控制模块中的参数寄存器来简化运算过程，使得剩余的无效神经元数据并不参与运算以加快处理速度。分段线性映射函数的两个“拐点”也是可配置的。另外，还可以将上一次运算结束后的输出神经元寄存器的数据读出后再反馈作为下一次运算开始前的输入神经元寄存器的数据，这样就可以动态实现任意多层神经网络，以完成复杂的识别任务。总之，该片上人工神经网络具有非常良好的可配置性。The effective number of input neurons (representing the image feature dimension) can be less than T1, and the effective number of output neurons (representing the classification number of target recognition) can also be less than T2; when these two situations occur, it can be passed The parameter register in the neural network control module is configured to simplify the operation process, so that the remaining invalid neuron data does not participate in the operation to speed up the processing speed. The two "inflection points" of the piecewise linear mapping function are also configurable. In addition, the data of the output neuron register after the last operation can be read out and fed back as the data of the input neuron register before the next operation, so that any multi-layer neural network can be dynamically realized to complete complex recognition task. In conclusion, the on-chip artificial neural network has very good configurability.

该人工神经网络的权重/阈值数据是由训练得到的，由于训练过程并不包含在系统正常工作的处理流程中，不影响系统运行的实时性，而且训练过程本身较复杂不宜用硬件直接实现，因此可以在RISC双核子系统甚至系统外的通用处理器上完成训练过程，训练结束后再将得到的权重和阈值数据下载到人工神经网络的权重/阈值存储器中。训练既可以采用有监督学习方式，也可采用无监督学习方式，因此可以实现包括反向传播(BP)神经网络、自组织映射(SOM)神经网络和矢量量化(LVQ)神经网络在内的多种人工神经网络。The weight/threshold data of the artificial neural network is obtained by training. Since the training process is not included in the normal working process of the system, it does not affect the real-time performance of the system operation, and the training process itself is too complicated to be directly implemented by hardware. Therefore, the training process can be completed on the RISC dual-core subsystem or even on the general-purpose processor outside the system. After the training, the obtained weight and threshold data can be downloaded to the weight/threshold memory of the artificial neural network. The training can adopt both supervised learning and unsupervised learning, so multiple neural networks including backpropagation (BP) neural network, self-organizing map (SOM) neural network and vector quantization (LVQ) neural network can be realized. an artificial neural network.

在本实施例中，T1＝16，T2＝8，J1＝8，J2＝8，J＝12，且权重/阈值存储器的容量为12bit×256，这样的精度和容量配置可以满足大多数应用中目标识别算法的需求，当特征维数高于16时，可以通过多次处理不同的特征子空间来完成识别过程。In this embodiment, T1=16, T2=8, J1=8, J2=8, J=12, and the capacity of the weight/threshold memory is 12bit×256, such precision and capacity configuration can satisfy most applications The requirements of the target recognition algorithm, when the feature dimension is higher than 16, the recognition process can be completed by processing different feature subspaces multiple times.

在本实施例中，图1中的精简指令处理器(RISC)双核子系统主要用于完成高级图像处理中不规则的复杂算法(比如人工神经网络的训练、霍夫变换、主分量分析等)、动态配置以及统一控制系统内其它模块的并行工作。该RISC双核子系统包括两个P比特数据位宽的精简指令处理器(RISC)核及各自的私有程序/数据存储器、处理器核间通信信箱和处理器仲裁器。其中每个RISC都可以独自访问自身私有的程序/数据存储器，但必须通过处理器仲裁器来申请访问片上其它资源，该仲裁器在硬件上支持先来先服务和固定优先级仲裁算法并可灵活配置改变。处理器核间通信信箱实质为一双端口同步FIFO，用于支持双核间的线程同步。该RISC双核子系统具有线程级并行处理能力，相比单核单线程处理可获得一定的加速比，减少复杂高级处理的时间。在本实施例中，P＝32而FIFO容量为32bit×16。In this embodiment, the RISC dual-core subsystem in FIG. 1 is mainly used to complete irregular and complex algorithms in advanced image processing (such as artificial neural network training, Hough transform, principal component analysis, etc.) , dynamic configuration and parallel work of other modules in the unified control system. The RISC dual-core subsystem includes two P-bit data width RISC cores and their own private program/data memories, inter-processor core communication mailboxes and processor arbitrators. Each RISC can independently access its own private program/data memory, but must apply for access to other resources on the chip through the processor arbiter, which supports first-come-first-serve and fixed-priority arbitration algorithms in hardware and can be flexibly Configuration changes. The communication mailbox between processor cores is essentially a dual-port synchronous FIFO for supporting thread synchronization between dual cores. The RISC dual-core subsystem has thread-level parallel processing capabilities, which can obtain a certain speed-up ratio compared with single-core single-thread processing, and reduce the time for complex and advanced processing. In this embodiment, P=32 and the FIFO capacity is 32bit×16.

在本实施例中，图1中的随机/顺序混合I/O存储器用于系统内外的数据交互，为一双端口存储器。为了尽量减少系统引脚数量，其中一个端口为P比特位宽，可由片上总线进行随机读写访问，另一端口为PS(PS＜P)比特位宽，可由片外器件进行顺序读写访问，且读写相互独立；片外进行顺序读写时的使能信号可被该存储器内嵌的地址生成模块自动映射成该存储器的物理地址；该物理地址可被外部重定向清零。该存储器面向系统内外的两个端口可以工作在不同的时钟频率下，有利于扩展系统的应用范围。在本实施例中，P＝32，而PS＝8。In this embodiment, the random/sequential mixed I/O memory in FIG. 1 is used for data exchange inside and outside the system, and is a dual-port memory. In order to reduce the number of system pins as much as possible, one of the ports is P bits wide, which can be accessed randomly by the on-chip bus, and the other port is PS (PS<P) bits wide, which can be accessed sequentially by off-chip devices. And reading and writing are independent of each other; the enable signal for sequential reading and writing outside the chip can be automatically mapped to the physical address of the memory by the address generation module embedded in the memory; the physical address can be cleared by external redirection. The two ports of the memory facing the inside and outside of the system can work at different clock frequencies, which is beneficial to expand the application range of the system. In this example, P=32 and PS=8.

在本实施例中，图1中的系统线程标志为W比特寄存器，其中某些比特由系统内部的片上总线负责控制写入，而另外一些比特则由系统外部器件负责控制写入；系统内外均可读标志寄存器的所有比特。该寄存器可作为片系统内外线程交互和同步的标志使用。在本实施例中，W＝4，且其中三个比特由系统外部控制写入，一个比特由系统内部控制写入。In the present embodiment, the system thread mark in Fig. 1 is a W bit register, wherein some bits are responsible for controlling writing by the on-chip bus inside the system, and other bits are then being responsible for controlling writing by external devices of the system; both inside and outside the system All bits of the flags register can be read. This register can be used as a flag for the interaction and synchronization of threads inside and outside the chip system. In this embodiment, W=4, and three bits are written by external control of the system, and one bit is written by internal control of the system.

在本实施例中，图1中的片上总线将来自RISC双核子系统主器件的读写控制信号和逻辑地址信息映射到其他各个总线从器件模块(包括图像传感器控制模块，处理阵列控制模块，片上人工神经网络，随机/顺序混合I/O存储器、线程标志)所需的选通使能信号和物理地址信息，以驱动这些从器件模块完成各种操作。在本实施例中，片上总线的数据位宽为32比特，最多支持16个从器件。In this embodiment, the on-chip bus in Fig. 1 maps the read and write control signals and logical address information from the RISC dual-core subsystem main device to other various bus slave modules (including image sensor control module, processing array control module, on-chip Artificial neural network, random/sequential mixed I/O memory, thread flag) required strobe enable signal and physical address information to drive these slave device modules to complete various operations. In this embodiment, the data bit width of the on-chip bus is 32 bits, and supports up to 16 slave devices.

在本实施例中，整个高速片上视觉系统(可编程视觉芯片)的工作流程如下：由图像传感器获得的数字像素数据以行并行方式载入到处理单元PE阵列中，在处理单元PE阵列和行处理单元阵列的协同配合下灵活完成各种低、中级图像处理，提取出图像特征送入片上人工神经网络进行特征识别，有时还需要精简指令处理器双核子系统做进一步分析处理，得到最终所需的少量结果数据并输出到系统外部。In this embodiment, the workflow of the entire high-speed on-chip vision system (programmable vision chip) is as follows: the digital pixel data obtained by the image sensor is loaded into the processing unit PE array in a row-parallel manner, and the processing unit PE array and row With the cooperation of the processing unit array, various low-level and intermediate-level image processing can be flexibly completed, and the extracted image features will be sent to the on-chip artificial neural network for feature recognition. Sometimes it is necessary to simplify the instruction processor dual-core subsystem for further analysis and processing to obtain the final required image. A small amount of result data and output to the outside of the system.

同时，RISC双核子系统还可以根据PE阵列和RP阵列所进行处理获得的宏观图像信息或感兴趣目标范围动态调整图像传感器控制模块的参数寄存器中的数据，以自动适应不断变化的应用环境，以及满足本系统或目标在环境中的相对运动所带来的多分辨率处理需求。At the same time, the RISC dual-core subsystem can also dynamically adjust the data in the parameter register of the image sensor control module according to the macroscopic image information obtained by processing the PE array and the RP array or the target range of interest, so as to automatically adapt to the changing application environment, and Meet the multi-resolution processing requirements brought by the system or the relative motion of the target in the environment.

下面通过在本实施例中所提出的基于可编程视觉芯片的视觉图像处理系统上开发运行的三个典型高速视觉图像处理算法来详细说明本实施例的具体应用。The specific application of this embodiment will be described in detail below through three typical high-speed visual image processing algorithms developed and run on the programmable vision chip-based visual image processing system proposed in this embodiment.

(一)高速目标追踪(1) High-speed target tracking

如图6所示，是基于本实施例视觉图像处理系统的高速目标追踪算法流程。首先利用图像传感器阵列捕获若干帧图像，在PE阵列中按一定规则合成一副背景图像，然后开始正常工作。正常工作时捕获的每一帧图像首先在PE阵列中平滑滤波去噪后减去背景图像，得到一副差分图像，然后利用RP阵列统计该图像的灰度值大致分布，以确定最佳的动态阈值，之后在PE阵列中以该阈值分割差分图像得到一副二值图像，该二值图像就是场景中有明显运动目标的区域。接下来再在PE阵列中利用二值形态学测地变换分割出该二值图像的每一个连通区域，利用RP阵列提取各区域形状特征并在RISC双核子系统中逐一与待追踪目标的特征作比较，在特征空间中欧氏距离或曼哈顿距离最小并且小于某个预先定义的距离时者就可认定为目标特征，据此锁定目标所在的区域和中心坐标，并将这些信息写入I/O存储器输出到片外。最后，将非运动区域的背景按照某种算法模型进行更新，以消除环境缓慢变化对追踪过程的干扰。在该算法中，如果目标和其他运动物体发生碰撞或运动到遮挡物之后，会消失若干帧，此时RISC双核子系统会自动根据目标之前的统计运动来预测输出目标当前所在区域坐标；但是当目标重新出现时，该算法又会立即将其锁定。该算法有较强的适应性和鲁棒性，可以处理复杂动态场景下具有多个不规则高速运动物体情况时的目标追踪。以上所述高速目标追踪算法可以达到1000帧/秒的处理速度。另外，在背景较简单的人工可控环境下，也可以应用专利ZL200510086902.2中妙维等人提出的“自窗捕捉”方法，并在目标追踪开始时手动指定被追踪目标所在的区域。该算法也能达到1000帧/秒的处理速度。As shown in FIG. 6 , it is a high-speed target tracking algorithm flow based on the visual image processing system of this embodiment. First, the image sensor array is used to capture several frames of images, and a background image is synthesized in the PE array according to certain rules, and then it starts to work normally. Each frame of image captured during normal operation is first smoothed and filtered in the PE array to denoise and subtract the background image to obtain a differential image, and then the RP array is used to count the approximate distribution of the gray value of the image to determine the best dynamic Threshold, and then divide the differential image with this threshold in the PE array to obtain a binary image, which is the area with obvious moving objects in the scene. Next, each connected region of the binary image is segmented by binary morphological geodesic transformation in the PE array, the shape features of each region are extracted by the RP array, and the features of the target to be tracked are compared one by one in the RISC dual-core subsystem. Comparison, when the Euclidean distance or Manhattan distance in the feature space is the smallest and less than a certain predefined distance, it can be identified as the target feature, and the area and center coordinates of the target are locked accordingly, and the information is written into the I/O memory output to off-chip. Finally, the background of the non-moving area is updated according to a certain algorithm model to eliminate the interference of the slow change of the environment on the tracking process. In this algorithm, if the target collides with other moving objects or moves to an occluder, several frames will disappear, and the RISC dual-core subsystem will automatically predict the current area coordinates of the output target based on the previous statistical motion of the target; but when When the target reappears, the algorithm immediately locks on to it again. The algorithm has strong adaptability and robustness, and can deal with target tracking when there are multiple irregular high-speed moving objects in complex dynamic scenes. The above-mentioned high-speed target tracking algorithm can reach a processing speed of 1000 frames per second. In addition, in a manually controllable environment with a relatively simple background, the "self-window capture" method proposed by Miaowei et al. in patent ZL200510086902.2 can also be applied, and the area where the tracked target is located can be manually specified when the target tracking starts. The algorithm can also reach a processing speed of 1000 frames per second.

(二)高速手势识别(2) High-speed gesture recognition

如图7所示，是基于本实施例视觉图像处理系统的高速手势识别算法流程。本发明所提出的手势识别算法支持四类手势的识别，主要用于基于自然人机交互的PPT手势控制系统，图8列出了这四类手势的阈值分割后的二值化图像以及相应的控制功能。该手势识别算法中，从背景合成到阈值分割这五步和高速目标追踪算法中的相同，之后在PE阵列中利用二值形态学区域修整算法去除小的杂散区域和填补大块区域中小的孔洞，最后的大块完整区域就是待识别手势所在的区域。之后利用人工神经网络进行识别，人工神经网络必须经过充分的训练才能用于识别，训练时首先提取手势识别区域的归一化致密度特征，即将该区域平均分为若干行和若干列，分别统计每一行和每一列激活像素(即二值图像中值为1的像素)的个数占该区域总面积的比值，这些比值组成一组向量，并且在系统线程标志的监督配合下用于神经网络的学习(即通过外部写线程标志寄存器来指示目前学习的是哪一类手势)，学习过程可以在系统内部的RISC双核子系统上完成，也可以在系统外的通用处理器上完成。学习完成之后就是识别过程，注意到待识别手势中的两种特殊情况(即没有待识别区域的“空白”手势和只有一根指头的特殊鼠标移动手势)，为了加快特征识别速度，算法采用了基于简单区域特征结合人工神经网络的级联分类器，该分类器首先提取待识别区域的简单特征(比如激活像素总数、形状参数、顶点坐标等)在RISC核上尝试识别出上述特殊手势，若不成功再进一步提取较复杂的完整归一化致密度特征并利用人工神经网络进行统一识别，最后输出识别出的手势类别代码以及手势顶点坐标(顶点坐标仅用于鼠标移动手势)。由于典型应用过程中的大部分时间都是所述两种特殊手势，因此整个处理速度可以得到很大提升，该系统的平均帧率可以达到1000帧以上。高帧率有利于进一步采用RISC核对识别结果进行基于软件的时域低通滤波，抑制环境噪声和手势抖动对识别结果造成的干扰。As shown in FIG. 7 , it is a high-speed gesture recognition algorithm flow based on the visual image processing system of this embodiment. The gesture recognition algorithm proposed by the present invention supports the recognition of four types of gestures, and is mainly used in the PPT gesture control system based on natural human-computer interaction. Figure 8 lists the binary images of these four types of gestures after threshold segmentation and the corresponding control Function. In the gesture recognition algorithm, the five steps from background synthesis to threshold segmentation are the same as those in the high-speed target tracking algorithm, and then the binary morphology area trimming algorithm is used in the PE array to remove small stray areas and fill in small areas in large areas. The hole, the last large complete area is the area where the gesture to be recognized is located. Afterwards, the artificial neural network is used for recognition. The artificial neural network must be fully trained before it can be used for recognition. During training, the normalized density features of the gesture recognition area are first extracted, that is, the area is divided into several rows and columns on average, and statistics are made separately. The ratio of the number of activated pixels in each row and each column (that is, pixels with a value of 1 in the binary image) to the total area of the region, these ratios form a set of vectors, and are used in the neural network under the supervision of the system thread flag Learning (that is, by externally writing the thread flag register to indicate which type of gesture is currently being learned), the learning process can be completed on the RISC dual-core subsystem inside the system, or it can be completed on a general-purpose processor outside the system. After the learning is completed, it is the recognition process. We noticed two special cases in the gestures to be recognized (that is, the "blank" gesture with no area to be recognized and the special mouse movement gesture with only one finger). In order to speed up the feature recognition, the algorithm uses A cascade classifier based on simple region features combined with artificial neural networks, the classifier first extracts simple features of the region to be recognized (such as the total number of activated pixels, shape parameters, vertex coordinates, etc.) and tries to recognize the above-mentioned special gestures on the RISC core. If it is unsuccessful, further extract more complex and complete normalized dense features and use the artificial neural network for unified recognition, and finally output the recognized gesture category code and gesture vertex coordinates (vertex coordinates are only used for mouse movement gestures). Since the above two special gestures spend most of the time in a typical application process, the overall processing speed can be greatly improved, and the average frame rate of the system can reach more than 1000 frames. The high frame rate is conducive to the further use of RISC to check the recognition results and perform software-based time-domain low-pass filtering to suppress the interference caused by environmental noise and gesture jitter on the recognition results.

(三)快速人脸检测(3) Fast face detection

如图9所示，是基于本实施例视觉图像处理系统的快速人脸检测算法流程，该算法可用于特殊场合下的人流量统计。应用本算法时，需要RISC核控制图像传感器每次输出待监测区域的一个64×64分辨率图像。本算法主要采用了Masakazu等人在2003年在IEEE Transactions on NeuralNetworks杂志上发表的An Image Representation Algorithm Compatible WithNeural-Associative-Processor-Based Hardware Recognition Systems一文中提到的PPED特征向量用于人脸检测，PPED特征向量的提取主要分为水平、垂直、正45度和负45度四个方向的5×5模板边缘检测及边缘标志生成，以及按一定规则组合压缩这四个方向边缘标志以形成一个64维的PPED向量这两步，并且在PE阵列和RP阵列上完成，之后利用人工神经网络判断是否是人脸，判断前必须利用标准人脸库中的模板对神经网络进行充分训练。由于特征维数较高，可以划分为特征子空间进行逐一训练和识别，或者在实时性要求较高而正确率不必太高的情况下，将64维的PPED向量进一步压缩为一个16维的向量以提高处理速度。在本实施例中的系统上，采用完整的64维PPED向量，用本算法对每一帧256×256图像中部的256×64区域划分为10个64×64的子区域(因为64×64的子区域之间必须有一定的重叠以尽量减少漏检情况)进行人脸检测所需的处理时间约为18ms，或者说整个系统的帧率可高于50帧/秒，远高于串行处理系统。As shown in FIG. 9 , it is a flow of a fast face detection algorithm based on the visual image processing system of this embodiment, and the algorithm can be used for counting the flow of people in special occasions. When applying this algorithm, it is necessary for the RISC core to control the image sensor to output a 64×64 resolution image of the area to be monitored each time. This algorithm mainly uses the PPED feature vector mentioned in the article An Image Representation Algorithm Compatible With Neural-Associative-Processor-Based Hardware Recognition Systems published by Masakazu et al. in IEEE Transactions on NeuralNetworks in 2003 for face detection. PPED The extraction of feature vectors is mainly divided into 5×5 template edge detection and edge mark generation in four directions: horizontal, vertical, plus 45 degrees and minus 45 degrees, and combining and compressing edge marks in these four directions according to certain rules to form a 64-dimensional The two steps of the PPED vector are completed on the PE array and RP array, and then the artificial neural network is used to judge whether it is a human face. Before the judgment, the neural network must be fully trained with the template in the standard face database. Due to the high feature dimension, it can be divided into feature subspaces for training and recognition one by one, or when the real-time requirements are high and the accuracy rate does not need to be too high, the 64-dimensional PPED vector is further compressed into a 16-dimensional vector to increase processing speed. On the system in this embodiment, the complete 64-dimensional PPED vector is adopted, and the 256×64 region in the middle of each frame of 256×256 image is divided into ten subregions of 64×64 by this algorithm (because the 64×64 There must be a certain overlap between sub-regions to minimize missed detection) The processing time required for face detection is about 18ms, or the frame rate of the entire system can be higher than 50 frames per second, much higher than serial processing system.

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A visual image processing system based on a programmable vision chip, characterized in that, comprising:

an image sensor for collecting raw image data at high speed, and transmitting the collected raw image data in parallel to a multi-stage parallel digital processing circuit; and

The multi-stage parallel digital processing circuit is used for performing fast parallel processing on the original image data received from the image sensor, and outputting the processing result.

2. the visual image processing system based on programmable vision chip according to claim 1, is characterized in that, described image sensor comprises:

N×N pixel array (1), used for high-speed acquisition of original image data, and outputting the collected original image data to N×1 parallel analog preprocessing array (3), wherein N is a natural number;

N×1 parallel analog preprocessing array (3), used to remove fixed noise in the original image data, improve the dynamic range of the original image data, and output to N×1 parallel analog-to-digital conversion array (4);

N×1 row parallel analog-to-digital conversion array (4), used to convert each column of analog pixel data into high-precision digital pixel data, and output to the output pixel selection module (5);

The output pixel selection module (5) is used to receive N digital pixel data of the N×1 row parallel analog-to-digital conversion array (4) in parallel as input, and select M pixel data therefrom as the output of the image sensor to realize selection of pixel rows, where M is a natural number and M<N; and

An image sensor control module (6), used to control an N×N pixel array (1), an N×1 row parallel analog preprocessing array (3), and an N×1 row parallel analog-to-digital conversion array (4) according to internal parameter registers and output the working sequence of the pixel selection module (5), to realize the dynamic control of the image sensor.

3. the visual image processing system based on programmable vision chip according to claim 2, is characterized in that,

The N×N pixel array (1) includes N×N two-dimensionally arranged pixel units (2), wherein each pixel unit (2) includes a photosensitive element and a corresponding readout circuit;

The N×1 parallel analog preprocessing array (3) includes N one-dimensionally arranged analog preprocessing units, wherein each analog preprocessing unit includes a correlated double sampling (CDS) circuit for removing fixed noise and a A controllable gain amplifier circuit (PGA) for improving the dynamic range;

The N×1 parallel analog-to-digital conversion array (4) includes N one-dimensionally arranged analog-to-digital conversion units;

The output pixel selection module (5) cooperates with the image sensor control module (6) to select pixel rows and columns to realize flexible area processing and/or sub-sampling processing of the image sensor.

4. the visual image processing system based on programmable vision chip according to claim 2, is characterized in that, the parameter register in the described image sensor control module (6), the data wherein can be carried out from module outside by on-chip bus interface Read and write to realize the dynamic control of the image sensor.

5. the visual image processing system based on programmable vision chip according to claim 2, is characterized in that, described image sensor control module (6) controls described N * N pixel array (1) rolling exposure, and every time Select one of the columns to output N analog pixel values to the N×1 row parallel analog preprocessing array (3) in a row parallel manner, and perform noise removal and dynamic range through the N×1 row parallel analog preprocessing array (3) Then enter the N×1 row parallel analog-to-digital conversion array (4) and convert it into high-precision digital pixel data in parallel, and finally output M digital pixel data through the output pixel selection module (5) as the final output of the image sensor The output is provided to the multi-stage parallel digital processing circuit.

6. the visual image processing system based on programmable vision chip according to claim 1, is characterized in that, described multistage parallel digital processing circuit comprises:

M×M pixel-level parallel processing unit array (7), used to perform local linear processing suitable for pixel-level parallel processing on the digital pixel data received from the image sensor, and output the processing result to the M×1 row processing unit array (9 ), wherein M is a natural number and M<N;

M×1 row processing unit array (9), used to accelerate the nonlinear processing and wide-area processing suitable for row-parallel processing in low-level and mid-level images, so as to realize the extraction of image features;

Processing array control module (11), used to fetch and control the M×M pixel-level parallel processing unit array (7) and the M×1 row processing unit from its internal variable-length single instruction multiple data (SIMD) instruction memory The control instruction of the array (9), and decoding and outputting to the M×M pixel-level parallel processing unit array (7) and the M×1 row processing unit array (9);

On-chip configurable artificial neural network (12), used to complete feature recognition or feature compression tasks in advanced image processing, its input is the feature vector data extracted by the M × 1 line processing unit array (9), and the output is feature recognition the result of;

A streamlined instruction processor dual-core subsystem (13), used to implement thread-level parallel processing, perform irregular processing in advanced image processing other than normal feature recognition, and control the entire system;

random/sequential mixed I/O memory (14);

system thread flag (15);

On-chip bus (16), used for mapping read and write control signals and logical address information from the RISC dual-core subsystem (13) to gate enable signals and physical addresses required by other bus slave modules information to drive these slave modules to complete various operations.

7. The visual image processing system based on a programmable visual chip according to claim 6, characterized in that, the M×M pixel-level parallel processing unit array (7) includes M×M two-dimensionally arranged pixel-level parallel Processing unit PE (8), all pixel-level parallel processing units PE (8) work in single instruction multiple data (SIMD) mode, accept the same PE array control instruction, perform the same operation, but the data operated comes from each unit local storage.

8. the visual image processing system based on programmable vision chip according to claim 7, is characterized in that, described each pixel level parallel processing unit PE (8) is corresponding to described N * N pixel array ( 1) one or more image pixels,

When each pixel-level parallel processing unit PE (8) corresponds to one pixel, since M<N, the entire processing unit array corresponds to an M×M sub-region image of the N×N pixel array (1) or the entire The M×M sub-sampled image of the N×N pixel array (1), at this time, the M×M pixel-level parallel processing unit array (7) processes a frame resolution M×M sub-image or Subsampled images for processing;

When each pixel-level parallel processing unit PE (8) corresponds to a plurality of pixels, the entire M×M pixel-level parallel processing unit array (7) corresponds to the entire N×N pixel array (1) or N×N pixel For the sub-regions larger than M×M in the array (1), the entire frame of image is processed in parallel with some pixels.

9. the visual image processing system based on programmable visual chip according to claim 8, is characterized in that, this visual image processing system is to switch pixel level parallel processing unit PE (8) and PE (8) dynamically by image sensor control module (6) and The correspondence between image pixels, thereby realizing multi-resolution visual image processing.

10. the visual image processing system based on programmable vision chip according to claim 7, is characterized in that,

The pixel-level parallel processing unit PE (8) is used to complete basic arithmetic logic operations such as 1-bit summation, negation, summation, and summation, and multi-bit arithmetic logic operations in low-level image processing are decomposed into the above-mentioned The basic 1-bit operation is realized on the pixel-level parallel processing unit PE (8);

The data of the pixel-level parallel processing unit PE (8) can be interactively transmitted with its upper, lower, left, and right adjacent processing units. Through multiple data transfers of adjacent processing units, each of the pixel-level parallel processing units PE (8) It can interact with other processing units at any position.

11. the visual image processing system based on programmable vision chip according to claim 7, is characterized in that, described pixel level parallel processing unit PE (8) comprises first operand selector (31), second operand Selector (32), 1 bit ALU (33), 1 bit temporary data register (34) and bit plane random access memory (35), wherein:

The first operand selector (31) selects one as a 1-bit ALU ( 33) the first operand;

The second operand selector (32) selects one from the output of the 1-bit temporary register (34) of this unit or the 1-bit immediate number 0 and 1 according to the control instruction output by the processing array control module (11) as 1 bit Second operand of the ALU (33).

12. the visual image processing system based on the programmable vision chip according to claim 11, is characterized in that, described 1 bit arithmetic logic operation unit (33) comprises: a full adder, a NOT gate, a two-input AND gate, a two-input OR gate, a carry register, and an output result selector; where,

The carry register is used to store the carry result generated by the addition operation, and the carry result is used for multi-bit arithmetic operations, and the carry register can be cleared by the control instruction output by the processing array control module (11);

The output result selector selects one from the output calculated by the full adder, the NOT gate, the AND gate and the OR gate according to the control instruction output by the processing array control module (11) as the output of the 1-bit arithmetic logic operation unit (33) result.

13. the visual image processing system based on the programmable vision chip according to claim 11, is characterized in that, described bit plane random access memory (35) is the small-capacity random access memory that data bit width is 1 bit, supports reading and writing simultaneously , its read-write address comes from the control instruction that described processing array control module (11) outputs, and its write-in data comes from the output of 1 bit arithmetic logic operation unit (33), and its read-out data is used as the first of this unit or adjacent processing unit One of the inputs to an operand selector.

14. The visual image processing system based on a programmable visual chip according to claim 11, characterized in that, the control instruction output by the processing array control module (11) can select the 1-bit arithmetic logic operation unit (33) Whether the output result data is written into the bit-plane RAM (35) or the 1-bit temporary register (34) each time, only one of them must be written each time.

15. the visual image processing system based on the programmable visual chip according to claim 6, is characterized in that, described M * 1 row processing unit array (9) comprises the row parallel processing unit RP (10 of M one-dimensionally arranged ), all row parallel processing units RP (10) work in single instruction multiple data (SIMD) mode, accept the same RP array control instruction, and perform the same operation, but the operated data comes from the local register of each unit;

The parallel processing unit RP (10) of each row is used to complete k-bit arithmetic operations, including addition, subtraction, absolute value, data shift, and comparison size, and data operations greater than k-bit can be decomposed into Several operations smaller than k-bit are performed serially.

16. the visual image processing system based on programmable vision chip according to claim 15, is characterized in that, described each row parallel processing unit RP (10) is corresponding to described M * M pixel level parallel processing unit array ( 7) For all the pixel-level parallel processing units PE(8) in the same row, the data of each pixel-level parallel processing unit PE(8) in the row can enter the row parallel processing unit RP(10) one by one for further operation.

17. The visual image processing system based on a programmable visual chip according to claim 15, wherein each row parallel processing unit RP can perform data interaction with the row parallel processing unit RP above and below it, some of which The row parallel processing unit RP can also perform data interaction with the row parallel processing units RP separated by S rows above and below it. These row parallel processing units RP are called skip row processing units, and the row parallel processing units other than these skip row processing units The unit RP is called a common row processing unit; in the entire row processing unit array, a jumping row processing unit is placed every S rows starting from the first row, and ordinary row processing units are placed in the remaining rows; where S is a natural number.

18. The visual image processing system based on programmable visual chip according to claim 17, characterized in that, the skip row processing unit can directly carry out data interaction at a long distance, without passing through all row parallel processing units RP (10) one by one ) for data interaction, which can realize fast and flexible inter-line wide-area processing.

19. The visual image processing system based on a programmable visual chip according to claim 15, wherein the row parallel processing unit RP comprises:

A k-bit buffer shift register (41), used to realize the serial-to-parallel/parallel-to-serial data conversion with the M×M pixel-level parallel processing unit array (7), and as an array external on-chip bus to the M× The data access interface of the 1-line processing unit array (9) can be updated by the read data of the register file of the RP unit to which it belongs;

A k-bit first operand selector (42), used for outputting from the register file of this unit or adjacent row processing units according to the control instruction output by the processing array control module (11), the buffer shift register of this unit Select one as the first operand of the k-bit arithmetic operation unit (44) in the output;

A k-bit second operand selector (43), used for selecting one from the immediate data output from the temporary register of this unit or from the array control instruction according to the control instruction output by the processing array control module (11) as the The second operand of the k-bit arithmetic operation unit (44);

A k-bit arithmetic operation unit (44) is used to perform wide-area processing and nonlinear processing, and the wide-area processing includes k-bit addition, subtraction, absolute value, data shift and size comparison;

A condition selector (45), used for the 1-bit data output from the pixel-level parallel processing unit PE (8) of the row where the unit is located according to the control instruction output by the processing array control module (11), from the k-bit arithmetic operation Select one in the condition flag register of unit (44) and 1bit constant 1 as conditional operation enabling signal, and this signal will enable described k-bit tri-state buffer gate (46);

A k-bit tri-state buffer gate (46) is used to receive the output result of the k-bit arithmetic operation unit (44), and determines whether to perform this operation under the control of the condition enable signal output by the condition selector (45). Write the data of k-bit temporary register (47) or the register file (48) of k-bit bit width, to realize conditional operation; And

A k-bit temporary register (47) and a k-bit wide register file (48).

20. the visual image processing system based on programmable vision chip according to claim 19, is characterized in that, described k-bit buffer shift register (41) can carry out left and right shift by bit under array control instruction, with Realize serial-to-parallel/parallel-to-serial data conversion with the M×M pixel-level parallel processing unit array (7); also under the control of the external signal of the array, it can be connected with the upper and lower units of the row parallel processing unit RP (10) All bits of the buffer shift register are shifted up and down in parallel to realize the data access of the M × 1 row processing unit array (9) by the on-chip bus outside the array; the output of the k-bit buffer shift register (41) is used as k- bit One of the inputs to the first operand selector (42), whose value can also be updated by read data from the register file.

21. The visual image processing system based on a programmable visual chip according to claim 19, characterized in that, the k-bit first operand selector (42) is in accordance with the control instruction from this unit or the adjacent row processing unit When selecting between the register file output of the unit and the output of the buffer shift register of the unit, if the unit is a skip line processing unit, its selection range also includes the skip line processing unit separated by S lines.

22. The visual image processing system based on programmable visual chips according to claim 19, characterized in that, said k-bit arithmetic operation unit (44) also updates its internal "carry/borrow" according to each operation result " and the "result is zero" flag register, which is convenient for data operations and conditional operations larger than k-bit; its flag register can be cleared by the control command output by the processing array control module.

23. the visual image processing system based on programmable visual chip according to claim 19, is characterized in that, the register file (48) of described k-bit bit width is data bit width k-bit, supports simultaneous read and write Small-capacity random access memory or register file, its read-write address comes from the control command that described processing array control module (11) outputs, and its write-in data comes from the output of k-bit tri-state buffer gate (46), and its read-out data is used as One of the inputs of the k-bit first operand selector (42) of this unit or an adjacent row processing unit; if this unit is a skipping row processing unit, it also includes a skipping row processing unit separated by S rows.

24. the visual image processing system based on the programmable vision chip according to claim 23, is characterized in that, the control instruction that described processing array control module (11) outputs is used for selecting described k-bit arithmetic operation unit ( 44) each output result data is written into k-bit temporary register (47) or the register file (48) of k-bit bit width, must when described k-bit tri-state buffer gate (46) is enabled And only one of them can be written.

25. the visual image processing system based on programmable vision chip according to claim 19, is characterized in that, described condition selector (45) can directly use the 1bit data from pixel level parallel processing unit PE (8) as condition The energy signal does not need to be serial-to-parallel conversion based on the k-bit buffer shift register (41), which is conducive to the realization of flexible and fast intra-line wide-area processing.

26. The visual image processing system based on the programmable visual chip according to claim 19, characterized in that,

When the M×1 line processing unit array (9) completes more complex algorithms and the storage space of the register file is not enough, the data can be stored in the M×M through the k-bit buffer shift register (41). In the pixel-level parallel processing unit array (7);

When all operations of the M×1 row processing unit array (9) are completed, the resulting data can be written into the k-bit buffer shift register (41), and then read out by the on-chip bus (16) outside the array.

27. the visual image processing system based on programmable vision chip according to claim 6, is characterized in that, described processing array control module (11) reads the position of instruction segment from variable-length SIMD memorizer by on-chip bus (16) ) is dynamically configured, and when the execution of this section of instructions is completed, a completion flag is generated and reported to the on-chip bus (16).

28. The visual image processing system based on a programmable visual chip according to claim 6, characterized in that, in order to support both the M×M pixel-level parallel processing unit array (7) and the M×1 row processing unit The coordinated operation of the array (9) reduces the required on-chip instruction storage space. The visual image processing system adopts a variable-length SIMD instruction mechanism, wherein a 2L-bit instruction word is stored on each address of the variable-length SIMD instruction memory, according to The instruction header can distinguish whether this is a 2L-bit ultra-long SIMD instruction that controls the cooperative work of the M×M pixel-level parallel processing unit array (7) and the M×1 row processing unit array (9), or controls all Two L-bit ordinary SIMD instructions that the M*M pixel-level parallel processing unit array (7) and the M*1 row processing unit array (9) work independently; the processing array control module (11) is embedded with The scheduling and decoding functional unit of variable-length SIMD instructions.

29. The visual image processing system based on a programmable visual chip according to claim 6, wherein the configurable artificial neural network (12) on the chip comprises:

Input neuron vector register group (51), including T1 input neuron registers, wherein each input neuron register is used to store J1 ratio specific point data, wherein T1<<M;

Neuron broadcaster (52), used to accept the data of the input neuron vector register group (51), and select one of them to broadcast to the parallel operation unit array (53) at a time, as in the parallel operation unit array (53) One of the operands of each parallel operation unit;

Parallel operation unit array (53), comprising T2 parallel operation units, T2≤T1, each parallel operation unit accepts the input neuron broadcast by the neuron broadcaster (52) as the first operand, and simultaneously receives the weight T2 weight/threshold data on each address of the/threshold memory (55) is used as the second operand, wherein the weight/threshold is J ratio specific point data, J>J1;

Output neuron vector register group (54), including T2 output neuron registers, wherein each output neuron register stores J2 ratio specific point data;

Weight/threshold memory (55), which stores the required weight and threshold data of the operation process, and there are T2 J ratio specific point data on each address;

The neural network control module (56) is used to control the parallel computing process of the entire on-chip configurable artificial neural network (12) according to the configured parameter information, and the memory address of the on-chip configurable artificial neural network (12) is determined by the neural network control module when it is working normally. (56) gives;

Bus read-write interface (57), for input neuron vector register group (51), output neuron vector register group (54), weight/threshold memory (55) in configurable artificial neural network (12) on-chip Data is externally written and read; the mapping function of the segmented linear mapping unit in the parallel computing unit and the control parameters of the neural network control module (56) are also flexibly configured by the bus read-write interface (57).

30. The visual image processing system based on a programmable visual chip according to claim 29, wherein each parallel operation unit includes a fixed-point multiplier, an accumulation register and a piecewise linear mapping unit, wherein the fixed-point The multiplier and the accumulation register are used to complete the multiplication and accumulation operation of the input neuron data and the corresponding weight factor/threshold, the accumulation register can be cleared by the neural network control module, and the piecewise linear mapping unit is used to realize the activation transfer function whose output is used to update the output neuron vector register set (54).

31. The visual image processing system based on programmable vision chips according to claim 29, characterized in that, under the control of the neural network control module (56), the neuron broadcaster (52) broadcasts An input neuron is sent to the parallel computing unit array (53), and the weight/threshold data corresponding to the broadcasted input neuron is taken out from the weight/threshold memory (55) to the parallel computing unit array (53) simultaneously. ), after being multiplied by the multipliers of each parallel operation unit, accumulate to the accumulating register, implement the piecewise linear mapping in parallel after all are completed, and send the final result into the described output neuron vector register group after being normalized into T2 bits ( 54).

32. The visual image processing system based on a programmable visual chip according to claim 31, characterized in that, the data written into the weight/threshold memory (55) and the configuration parallel operation unit and neural network control module (56) The data is obtained according to the training result of the neural network, and the training process is realized on the reduced instruction processor dual-core subsystem (13) or the general-purpose processor outside the system.

33. The visual image processing system based on a programmable vision chip according to claim 29, wherein the configurable artificial neural network (12) on the chip supports a maximum of T1 input neurons, a maximum of T2 output neurons, And T2≤T1, when the number of input neurons is less than T1, or the number of output neurons is less than T2, the corresponding data in the remaining input neuron registers, output neuron registers and weight/threshold memory will be automatically set to 0.

34. the visual image processing system based on programmable vision chip according to claim 29, is characterized in that, the data of described output neuron register is read out by on-chip bus (16), and input to input neuron register again, Realize the computation of multi-layer neural network.

35. The visual image processing system based on a programmable vision chip according to claim 6, characterized in that, the RISC dual-core subsystem (13) includes No. 1 RISC core (RISC#1), No. 1 RISC private program/data memory, No. 2 reduced instruction processor core (RISC#2), No. 2 RISC private program/data memory, inter-processor core communication mailbox and processor arbitrator, in which:

No. 1 RISC #1 and No. 2 RISC cores (RISC #2) of the RISC dual-core subsystem (13) respectively have a private program/data memory of P bit data bit width , to achieve thread-level parallel processing, responsible for irregular processing in advanced image processing other than normal feature recognition and control of the entire system.

36. The visual image processing system based on the programmable visual chip according to claim 35, characterized in that,

The No. 1 RICP core (RISC#1) and No. 2 RICP core (RISC#2) use the inter-processor core communication mailbox to communicate to realize necessary thread synchronization and data exchange;

The access rights of the No. 1 RISC processor core (RISC#1) and the No. 2 RISC processor core (RISC#2) to the on-chip bus are controlled by the processor arbiter, which is implemented on the hardware Support fixed priority and first-come-first-served two arbitration methods;

The inter-processor core communication mailbox is a synchronous bidirectional FIFO.

37. The visual image processing system based on a programmable visual chip according to claim 35, characterized in that, the RISC dual-core subsystem (13) is also based on the M×M pixel-level parallel processing unit array ( 7) dynamically adjust the data in the parameter register of the image sensor control module (6) by processing the obtained macroscopic image information or the target range of interest with the M×1 row processing unit array (9), to continuously adapt Changing application environments, and satisfying the multi-resolution processing requirements brought about by the relative motion of the system or the target in the environment.

38. The visual image processing system based on programmable visual chip according to claim 6, characterized in that, said random/sequential mixed I/O memory (14) is a dual-port memory, wherein one port is P bit width , which can be accessed randomly by the on-chip bus, and the other port is PS (PS<P) bit wide, and the sequential read and write access is performed by off-chip devices, and the read and write are independent of each other; the enabling of sequential read and write outside the chip The signal can be automatically mapped to the physical address of the memory by the address generation module embedded in the memory; the physical address can be cleared by external redirection.

39. the visual image processing system based on programmable vision chip according to claim 6, is characterized in that, described system thread mark (15) is W bit register, and wherein some bits are controlled by system internal on-chip bus (16) It is responsible for controlling the write, while other bits are controlled by the external device of the system; all bits of the flag register can be read inside and outside the system.

40. The visual image processing system based on a programmable visual chip according to claim 6, characterized in that, the on-chip bus (16) transmits the read and write control signals from the RISC dual-core subsystem (13) When the logical address information is mapped to the strobe enable signal and physical address information required by other bus slave device modules, the device modules include image sensor control module, processing array control module, on-chip artificial neural network, random/order hybrid I/O memory, and system thread flags.

41. The visual image processing system based on a programmable visual chip according to claim 6, characterized in that, in the visual image processing system, the digital pixel data obtained by the image sensor is loaded into the M In the ×M pixel-level parallel processing unit array (7), various low-level 1. Intermediate image processing, extracting image features and sending them to the on-chip artificial neural network (12) for feature recognition, and further analysis and processing by the reduced instruction processor dual-core subsystem (13), to obtain a small amount of final required result data and output .