基于FPGA的快速图像处理系统的设计中英文翻译资料.doc
基于FPGA的快速图像处理系统的设计摘要我们评估、改进硬件、软件架构的性能,目的是为了适应各种不同的图像处理任务。这个系统架构采用基于现场可编程门阵列(FPGA)和主机电脑。PC端安装Lab VIEW应用程序,用于控制图像采集和工业相机的视频捕获。通过USB2.0传输协议执行传输。FPGA控制器是基于ALTERA的Cyclone II芯片,其作用是作为一个系统级可编程芯片(SOPC)嵌入NIOSII内核。该SOPC集成了CPU,片内、外部内存,传输信道,和图像数据处理系统。采用标准的传输协议和通过软硬件逻辑来调整各种帧的大小。与其他解决方案作比较,对其一系列的应用进行讨论。关键词:软件/硬件联合设计;图像处理;FPGA;嵌入式1、导言传统的硬件实现图像处理一般采用DSP或专用的集成电路(ASIC)。然而,随着对更高的速度和更低的成本的追求,其解决方案转移到了现场可编程门阵列(FPGA)身上。FPGA具有并行处理的特性以及更好的性能。当一个程序需要实时处理,如视频或电视信号的处理,机械操纵时,要求非常严格,FPGA可以更好的去执行。当需要严格的计算功能时,如滤波、运动估算、二维离散余弦变换(二维DCTs )和快速傅立叶变换( FFTs )时,FPGA能够更好地优化。在功能上,FPGA更多的硬件乘法器、更大的内存容量、更高的系统集成度,轻而易举地超越了传统的DSP。以计算机为基础的成像技术的应用和基于FPGA的并行控制器,这需要生成一个软硬件接口来进行高速传输。本系统是一个典型的软硬件混合设计产品,其中包括电脑主机中运行的LvbVIEW进行成像,配备了摄像头和帧采集,在另一端的Altera的FPGA开发板上运行图像滤波器和其他系统组件。图像数据通过USB2.0进行高速传输。各硬件部件和FPGA板的控制部分通过嵌入的NIOSII处理器进行关联,并利用USB2.0作为沟通渠道。2、设计工具概述通过FPGA设计DSP系统往往采用高级别算法开发工具和硬件描述语言,例如MATLAB。它也可采用具有第三方知识产权的IP内核执行典型的DSP功能或高速通信协议。在我们的应用中,我们使用的模型设计工具例如Mathworks Simulink来建立DSP。将其生成HDL代码后利用Quartus II与其他硬件设计文件综合。SOPC-Builder作为一个工具驻留在Quartus环境中,其作用是将NIOSII与外部逻辑硬件或标准外设融为一体。SOPC-Builder提供了一个界面结构,以互联NIOSII和外部存储器、滤波器、以及主机电脑。3、滤波器的模型和应用设计这个工作的主要目标就是评估主、协处理器进行图像处理的性能,包括嵌入式的NIOSII的性能以及电脑主机与FPGA板之间的USB2.0传输性能。现有FPGA的性能可能会造成图像处理的局限性。为了完成目标,我们建立了一个典型的图像处理应用,以针对FPGA协处理器。包括一个噪声滤波器和一个边缘检测器。降噪和边缘检测这两个基本过程运用到各种机器视觉中,如目标识别,医学成像,下一代的汽车行进路线检测,人员追踪,控制系统等方面。我们的噪声模型和边缘检测使用了Altera DSPBuilder Libraries in Simulink。这方面有个例子可以从11找到,利用高斯3 · 3 kernel降噪。边缘检测利用典型的Prewitt或Sobel滤波器。这些功能可用于合并一系列边缘检测后减少噪声。图1为滤波器的设计框图。图 1滤波器的设计框图除了噪声检测和边缘滤波,还有中间处理逻辑关系的模块用于协调NIOS II数据和控制路径还有滤波模块工作时序。这种中间的硬件结构定义为Avalon界面12。这个接口不能在Simulink环境下仿真,是相当于嵌入系统的Verilog文件。Avalon执行由一个16位数据输入和输出的路径,相应的读写控制信号和一个控制接口可以选择中间输出高斯滤波或边缘检测。数据的输入输出在逻辑模块的帮助下存入FIFO寄存器。每个接收到的图像帧存入外部SDRAM内存缓冲区,并转换为适用于NIOSII操作的16位数据流的方式。在第五和第六节将讨论NIOSII编码的问题。传入的图像通过一个简单的二维数字有限脉冲响应卷积滤波器,处理在3·3区域范围内相邻像素的灰阶强度。产生缓冲的原理图如图2所示。图 2我们假设图像大小为640*480像素。该缓冲电路以同样的方法来为滤波器提供缓冲空间。如果改变帧的大小,我们需要重新设计和编译。延迟数量取决于块的大小,延迟深度取决于每行有多少像素。开发板上具有片外RAM因此不会消耗FPGA逻辑要素。图3从左至右分别为原始图像、高斯滤波图像、边缘滤波图像。图 34、嵌入式系统设计协处理器执行上述所描述的做为组件的NIOSII处理器。NIOSII处理器在这里的作用是处理数据流。这种设计经常用于基础工业和学术项目。一旦安装综合软件,NIOSII将成为Quartus中的一个元件。DSP-Builder将设计出来的模型转换成HDL编码以便适用于其他硬件组件。通过综合软件,滤波器可以很容易地集成到SOPC中并与NIOSII结合。NIOSII软核与其他模块构成了一个完整的系统,包括外部存储器控制器、DMA通道、以及一个定制的USB高速通信IP核。VGA控制器可以将最终结果输出至屏幕。诸如此类的功能,可以通过获得开源的IP核来或是第三方公司提供的评估版IP核来实现。USB2.0高速接口通过一块扩展板被添加到FPGA母板上。做为系统级的解决方案,通过Santa-Cruz周边设备连机器可以将扩展子板插入到任何的Altera母板上。这个子板提供了一个基于PHY CY7C68000的USB2.0收发器。一个符合UTMI规范的继承USB控制功能的NIOSII系统。第8节我们将对IP核的实际性能进行评估。图4为FPGA的流程图。图6为FPGA开发板和图像采集部分。图 4 FPGA设计流程图图 55、NIOS软核设计:NIOS配置完毕后,将nios的代码下载。利用C语言来写nios中的代码是有双重目的的:(a)它控制硬件业务,如硬件之间的DMA传输单元。它还提供一个编程接口,处理数据通道,通过”API”命令如“open”“read”,“write”和“close”来控制。(b)它允许系统进行简单的对输入信号进行软件处理而不是使用专用的硬件来处理。例如,nios指令代码可以用来转换图像阵列成为适合的一维数据流。6、Activity flow根据软件和硬件的活动,其混合结构的功能可概括如下:(a)图像流是从电脑主机经过usb2.0高速串行总线到达FPGA母板。在下一个章节将会描述使数据通过usb输入输出的应用程序编程接口。(b)内置的DMA数据总线将内存中的数据传送到nios中处理然后依靠Avalon传至硬件数字逻辑。(c)通过硬件加速器来处理数据流。(d)硬件逻辑对图像数据进行滤波后在通过DMA传送至存储器中。(e)最终结果输出到VGA的数模转换通道上。做为nios处理器的外围设备,支持DMA传输方式。然而做为VGA接口的数模转换芯片并不是实时执行所有数据的转换。因此有一个比较可能的做法就是将数据通过usb返回至电脑主机再做进一步处理成为简单的图像数据。需要指出的是这个设计不仅仅是为了做为黑盒子那样的专门应用,这是代表了一种设计方法,可广泛地定制应用。7、接口设计与应用基于PC的应用软件和部分视觉系统的的实施适用于各种工业应用。这套系统包括了windowsXP操作系统、奔腾4处理器、usb2.0高速串行总线控制器和NI1408PCI图像采集卡。主机的应用程序是基于LabVIEW虚拟仪器,它用于控制图像采集,并进行初步的图像处理。图6为PC端LabVIEW控制界面。图 6 LabVIEW控制界面图像采集卡最多可支持5个工业相机进行不同的任务。我们的系统中应用CCD相机捕捉全帧大小为640*480黑白画面,但是最终采集后的是320*240的。这样可以生成更小的数据量易于持续传输。LabVIEW主程序与USB之间的通信使用了API函数和动态链接库。LabVIEW的优势在于其集成了一个图像处理平台,能够进行快速的图像数据处理或预处理。当FPGA板接收完一个完整的图像阵列后,系统将图像送至滤波器,经过滤波处理后将数据送至VGA控制器中的缓存模块。8、系统性能评估上面已经建立了一套图像捕获装置。通过发送一些测试数据来测试USB对pc和FPGA实验板之间接收和发送性能。经过测试我们发现主机和目标板之间的发送接收有效载荷为307,200字节。当nios的Hal驱动程序版本为1.2时接收速度达到65Mbits/s,传输速度达到80Mbps。全速传输效率为9秒。9、与其他系统进行对比下面我们对比一下其他图像处理的解决方案以及性能和灵活性。为此我们通过搭建其他解决方案并进行一系列实验来来获取对比数据。我们设计了不同的滤波器来验证计算复杂性。经过与结果相比在奔腾4处理器和512兆内存的计算机上结果如图7所示。图 710、结论本文提出了一个融合电脑主机和FPGA的设计方案。并研究了基于此系统下的图像处理性能。这也代表了一种设计方法,可用于广泛的定制应用。它是基于FPGA可编程器件并以内嵌nios处理器的形式执行。Design and evaluation of a hardware/software FPGA-based system for fast image processingJ.A. Kalomiros a,*, J. Lygouras bAbstractWe evaluate the performance of a hardware/software architecture designed to perform a wide range of fast image processing tasks.The system architecture is based on hardware featuring a Field Programmable Gate Array (FPGA) co-processor and a host computer. ALabVIEW host application controlling a frame grabber and an industrial camera is used to capture and exchange video data with the hardware co-processor via a high speed USB2.0 channel, implemented with a standard macrocell. The FPGA accelerator is based on a Altera Cyclone II chip and is designed as a system-on-a-programmable-chip (SOPC) with the help of an embedded Nios II software processor. The SOPC system integrates the CPU, external and on chip memory, the communication channel and typical image filters appropriate for the evaluation of the system performance. Measured transfer rates over the communication channel and processing times for the implemented hardware/software logic are presented for various frame sizes. A comparison with other solutions is given and a range of applications is also discussed.Keywords: Hardware/software co-design; Image processing; FPGA; Embedded processor1. IntroductionThe traditional hardware implementation of image processing uses Digital Signal Processors (DSPs) or Application Specific Integrated Circuits (ASICs). However, the growing need for faster and cost-effective systems triggers a shift to Field Programmable Gate Arrays (FPGAs),where the inherent parallelism results in better performance1,2. When an application requires real-time processing,like video or television signal processing or real-time trajectory generation of a robotic manipulator, the specifications are very strict and are better met when implemented in hardware 35. Computationally demanding functions like convolution filters, motion estimators, two-dimensional Discrete Cosine Transforms (2D DCTs) and Fast Fourier Transforms (FFTs) are better optimized when targeted on FPGAs 6,7. Features like embedded hardware multipliers, increased number of memory blocks and system-on-a-chip integration enable video applications in FPGAs that can outperform conventional DSP designs2,8.On the other hand, solutions to a number of imaging problems are more flexible when implemented in software rather than in hardware, especially when they are not computationall demanding or when they need to be executed sporadically in the overall process. Moreover, some hardware components are hard to be re-designed and transferred on a FPGA board from scratch when they are already a functional part of a computer-based system. Such components are frame grabbers and multiple-camera systems already installed as part of an imaging application or other robotic control equipment.Following the above considerations we conclude that it is often needed to integrate components from an alreadyinstalled computer-based imaging application dedicated to some automation system, with FPGA-based accelerators that exploit the low-level parallelism inherent in hardware structures. Thus a critical need arises for an embedded software/hardware interface that can allow for high-bandwidth communication between the host application and the hardware accelerators.In this paper we apply and evaluate the performance of an example mixed hardware/software design that includes on the one side a host computer running a National Instruments(NI) LabVIEW imaging application, equipped with a camera and a frame-grabber, and on the other side a Altera FPGA board 9 running an image filter hardware accelerator and other system components. The communication channel transferring image data from the host computer to the hardware board is a high-speed USB2.0 port by means of an embedded macrocell. The various hardware parts and peripherals on the FPGA board are controlled and interconnected by a Nios-II embedded soft-processor.As a result of this evaluation one can explore the range of applications suitable for a host/co-processor architecture including an embedded Nios-II processor and utilizing an USB2.0 communication channel.In the following, we first give a short account of the tools we used for system design. We also present an overview of the particular image filtering application we embedded in the FPGA chip for the evaluation of the host/co-processor system architecture. We describe the modular interconnection of different system parts and assess the performance of the system. We examine the speed and frame-size limits of such a design when it is dedicated to image processing.Finally, we compare our mixed host/co-processorUSB-based design in terms of other architectures and other communications media.2. Design tools overviewThe design of a DSP system with FPGAs often utilizes both high-level algorithm development tools and hardware description language (HDL) tools. It can also make use of third-party intellectual property (IP) cores implementing typical DSP functions or high speed communication protocols1.In our application we use model-based design tools like The Mathworks Simulink (based on Mathworks MATLAB) with the libraries of Alteras DSP-Builder. The DSP-Builder uses model design to produce and synthesize HDL code, which can then be integrated with other hardware design files within a synthesis tool, like the Quartus II development environment. In the present work, we designed image filter components using DSP-Builder libraries and the resulting blocks were integrated with the rest of the system in Quartus System-On-a-Programmable-Chip (SOPC) Builder.SOPC-Builder design software resides as a tool in the Quartus environment. Its purpose is to integrate an embedded software processor like Alteras Nios-II with hardware. logic and custom orstandard peripherals within an overall system, often called System-On-a-Programmable-Chip(SOPC). SOPC-Builder provides an interface fabric in order to interconnect the Nios-II processing path with embedded and external memory, the filter co-processors, other peripherals and the channels of communication with the host computer.Nios-II applications were written in ANSI C and were compiled and downloaded to the FPGA board by means of Alteras Nios II Integrated Development Environment(IDE), a tool dedicated to assemble code for Nios processors.The purpose of Nios-II applications is to control processing and data streaming between the components of the system and its peripherals.On the host side one may develop a control application by means of any suitable language like C. We use Lab-VIEW software by National Instruments Corporation10,which provides a very flexible platform for image acquisition, image processing and industrial control.3. Modeling and implementation of the filter designThe main target of this work is to evaluate the performance of a host/co-processor architecture including an embedded Nios-II processor and utilizing a communication channel between host and hardware board, like a USB2.0 channel. The task-logic performed by the embedded accelerator can be any image function within the limitations of existing FPGA devices.For our purpose we built a typical image-processing application in order to target the FPGA co-processor. It consists of a noise filter followed by an edge-detector.Noise reduction and edge detection are two elementary processes required for most machine vision applications,like object recognition, medical imaging, lane detection in next-generation automotive technology, people tracking,control systems, etc.We model noise and edge filtering using the Altera DSPBuilder Libraries in Simulink. An example of this procedure can be found in 11. Noise reduction is applied with a Gaussian 3 · 3 kernel while edge detection is designed using typical Prewitt or Sobel filters. These functions can be applied combined in series to achieve edge detection after noise reduction. The main block diagram of our filter accelerator is shown in Fig. 1. Apart from noise and edge filter blocks, there is also a block representing the intermediate logic between the Nio-II data and control paths and our filter task logic. Such intermediate hardware fabric follows a specific protocol referred to as Avalon interface 12.This interface cannot be modeled in the Simulink environment and is rather inserted in the system as a Verilog file.Design examples implementing the Avalon protocol can be found in Altera reference designs and technical reports13. In brief, our Avalon implementation consists of a 16-bit data-input and output path, the appropriate Read and Write control signals and a control interface that allows for selection between the intermediate output from the Gauss filter or the output from the edge detector. Data input and output to and from the task logic blocks is implemented with the help of Read and Write instances of a 4800 bytes FIFO register.Each image frame when received by the hardware board is loaded into an external SDRAM memory buffer and is converted into an appropriate 16-bit data stream by means of Nios-II instruction code. Data transfer between external memory buffers and the Nios-II data bus is achieved through Direct Memory Access (DMA) operations controlled by appropriate instruction code for the Nios-II soft processor. Nios-II code flow for this system is discussed in Sections 5 and 6.Incoming pixels are processed by means of a simple 2D digital Finite Impulse Response (FIR) filter convolution kernel, working on the grayscale intensities of each pixels neighbors in a 3 · 3 region. Image lines are buffered through delay-lines producing primitive 3 · 3 cells where the filter kernel applies. The line-buffering principle is shown in Fig. 2. A z1 delay block produces a neighboring pixel in the same scan line, while a z640 delay block produces the neighboring pixel in the previous image scan line.We assume image size of 640 · 480 pixels. The line-buffer circuit is implemented in the same manner for both noise and edge filters. Frame resolution is incorporated in the line-buffer diagram as a hardware built-in parameter. If a change in frame size is required we need to re-design and re-compile. The number of delay blocks depends on the size of the convolution kernel, while delay line depth depends on the number of pixels in each line. Each incoming pixel is at the center of the mask and the line buffers produce the neighboring pixels in adjacent rows and columns.Delay lines with considerable depth are implemented as dedicated RAM blocks in the FPGA chip and do not consume logical elements.After line buffering, pipelined adders and embedded multipliers calculate the convolution result for each central pixel. Fig. 3 shows the model-design for implementation of the 3 · 3 Gauss kernel calculations. As is shown in Fig. 3 model-based design transfers the necessary arithmetic into a parallel digital