Edge based embedded vision systems often use machine learning techniques to embed intelligence within the solution. These solutions need to be not only capable of high performance but also power efficient, flexible and deterministic due to their use cases.
To address these challenges, embedded vision developers utilise the Xilinx® All Programmable Zynq® SoC or Zynq® UltraScale+™ MPSoC devices and develop applications using the reVISION™ acceleration stack and SDSoC™ system optimising compiler. These devices provide programmable logic coupled with high performance ARM® A53 or A9 processors, forming a tightly-integrated heterogeneous processing system, while reVISION and SDSoC provide the ability to work with and accelerate industry standard frameworks and libraries such as OpenCV, OpenVX and Caffe. Accelerating these frameworks allows developers to leverage the inherently parallel nature of programmable logic to create image processing pipelines and machine learning inference engines. This leaves the processing system free to implement higher-level decision making, scheduling, communication and system-management functions.
For many edge processing applications, a low-latency decision and response loop is also of critical importance. One example would be a vision-guided autonomous robot, where the response time is critical to avoid injury of people or damage of its environment.
To achieve this, reVISION provides both hardware-optimised OpenCV functions and machine learning inference stages such as Conv, reLU, Max Pooling and Fully Connected stages. To provide an optimal solution within the programmable logic, machine learning applications are increasingly using more efficient reduced-precision fixed-point number systems, such as INT8. The use of fixed-point number systems comes without significant loss in accuracy when compared with a traditional floating-point 32 (FP32) approach and brings with it power and performance benefits when used in a Xilinx device. To support the generation of a reduced-precision number system, the automated Ristretto tool for Caffe enables the training of neural networks using fixed-point number systems. This removes the need and associated potential performance impact of converting floating-point training data into fixed-point training data.
Using INT8 allows operations to use the DSP48E2 slices available within the UltraScale+ architecture. These DSP elements provide a performance increase as they are dedicated multiply accumulate elements designed for performing fixed-point math. The structure of these DSP elements enables a resource-efficient implementation as each can perform up to two INT8 MACC operations if they use the same kernel weights. This approach can provide up to a 1.75 times throughput improvement while enabling a cost-optimised solution with two to six times increased power efficiency (GOPS per Watt) when compared with competing devices.
One common machine learning use case is a collision detection and automatic braking system. Comparing solution implementations using a GoogLeNet Convolutional Neural Network running on an Nvidia TX1 with 256 Cores and a batch size of 8, against a Xilinx ZU9 with a batch size of 1 developed with reVISION, demonstrates the power of reVISION. Assuming an initial vehicle speed of 65 MPH, the reVISION example reacts within 2.7ms, applying the brakes and avoiding a collision, providing a more responsive solution.
reVISION and SDSoC provide the ability to work with high-level frameworks like Caffe and reduce the development time significantly for embedded vision and machine learning applications. The resultant solution is also more responsive, flexible and power efficient, as demanded by edge-based applications. These benefits can be leveraged by the user without the need to be an HDL specialist.