AI Accelerators - Hardware For Artificial Intelligence

CPUs were not as powerful and efficient a few decades ago when it came to running large computations for machine learning. Hardware manufacturers have worked hard to create a processing unit capable of performing any AI operation.

AI Accelerators - Hardware For Artificial Intelligence

Machine Learning is the process of making computer systems learn without explicit instructions by analyzing and drawing inferences from data patterns using algorithms and statistical models. One of the major limitations of Artificial Intelligence and Machine Learning has always been computational power, which has been a cause of concern for researchers. CPUs were not as powerful and efficient a few decades ago when it came to running large computations for machine learning. Hardware manufacturers have worked hard to create a processing unit capable of performing any AI operation.

Though CPUs are no longer viable sources of computational power, they were the pioneers. Today, those CPUs are rightfully replaced by GPUs and AI accelerators, specifically designed for large computing. The main features considered while purchasing an AI accelerator are cost, energy consumption, and processing speed.

What is an AI Accelerator?

An AI accelerator is a powerful machine learning hardware chip that is specifically designed to run artificial intelligence and machine learning applications smoothly and swiftly. Examples of AI accelerators are Graphics Processing Unit (GPU), Vision Processing Unit (VPU), Field-Programmable Gate Array (FPGA), Application-Specific Integrated Circuit (ASIC), and Tensor Processing Unit (TPU).

1. Vision Processing Unit (VPU) for Machine Learning

In addition to graphics cards, Nvidia and AMD also produce separate microprocessors that are dedicated to machine learning.  These specialized processors or Vision Processing Units (VPU) are designed specifically for Deep neural networks such as CNN's and RNN's; they handle the vision processing calculations required by image recognition and classification tasks with relative ease compared to GPU's which are tasked with performing a variety of other computational operations.  This is why they typically have more cores with higher clock speeds than traditional GPUs.

Vision Processing Units are good for Convolutional Neural Networks (CNN) such as Image recognition, object detection, and classification.  The first VPU was the Cuda-2000 unveiled by Nvidia in 2004. More recently, AMD has released the Baffin which can be used for most DL tasks (TensorFlow, PyTorch, Caffe2) along with computer vision algorithms such as visual saliency detection, image segmentation & recognition, etc.

Deep learning inference speed on an Intel CPU is roughly 20x slower than on a Pascal GPU/VPU of equal power. For example, if you were to compare two CPUs on an almost identical benchmark and data set the difference would be significant.  Inference time scales linearly with the number of VPU cores while it scales quadratically with the number of CPU cores.  At a certain number of VPU cores, you would be better off spending the same dollar amount on higher-end CPUs.

Vision Processing Units are used by a variety of companies that require a significant amount of image processing such as Facebook & Pinterest for object recognition, to Google who uses the Tensor Processing Unit (TPU) for a wide range of services including search related features. According to Nvidia's CEO Jensen Huang, VPUs are effective at performing tasks like video analytics, computer vision, or machine learning algorithms which require "thousands and thousands" of cores for training models.

One example of a VPU is Intel’s Movidius Myriad X which is used by the company to power a wide range of products such as their RealSense Computer Vision Development Kit, Cloud Platforms (Amazon AWS & Microsoft Azure), Smartphones (Galaxy S8, iPhone XS Max) and Tablets.  The VPU is also being used for robotic navigation & autonomous driving systems. Moreover, image recognition, classification & object detection is also deployed for Augmented Reality in Hololens & Magic Leap.  Intel states that their VPU is capable of achieving 60 FPS at 2304 x 1152 resolution with an accuracy rate of 93%.

Applications of Vision Processing Units (VPUs)

Vision Processing Units are best to use for image recognition and object detection tasks. CNN's are comprised of several connected layers that gradually increase the complexity or dimensionality of the input pattern as it passes through. The first layer detects edges followed by convolutions which detect features, and so on until an end-to-end classification is achieved.

Since CNN's perform operations like max-pooling, subsampling, etc., they require intensive number-crunching power thus making them ideal for VPU's with more cores/higher clock speeds.

Other use cases of Vision Processing Units (VPUs) include:

  1. Autonomous Driving - VPUs can be used to run deep learning models at high resolution for image processing in autonomous cars.  For example, Nvidia uses its Drive PX platform with an onboard VPU to power its self-driving car systems.
  2. Visual saliency detection - detects the most salient objects present in a scene and draws attention to those objects.
  3. Image segmentation & recognition - can be used for pixel-based labeling of images and classified by object types along with accurate classification results. Examples include Baidu's Deep Image which was developed using Google's TensorFlow framework or Facebook's Canvas image recognition system which performs tasks like semantic segmentation, creating fine-grained image recognition models, etc.
  4. Virtual Reality & Augmented Reality - Virtual reality is considered to be a visualization technique that can provide real-time 3d simulations by using VPU's for object detection and scene analysis.  The same goes with augmented reality where they can be used to build digital objects into the user's physical world.
  5. Security - Since Deep learning algorithms are often trained using labeled data, security platforms such as face recognition & facial behavior analysis use these types of processors.

Vision Processing Units are mostly produced by Nvidia (Tensor) which is built on top of their current architecture known as Volta.  AMD also recently announced that it will begin shipping its third generation graphics processing unit called Navi in 2019; this chip is expected to offer up to 512 GB/sec memory bandwidth while competing against Nvidia’s Volta architecture.  The advantage to this GPU chip is that it will be compatible with GPGPU algorithms, unlike Nvidia's V100.

2. Field-Programmable Gate Array (FPGA)

Field-Programmable Gate Arrays are programmable integrated circuits that can be configured by a client for a specific task after manufacturing.  FPGAs gained popularity due to their versatility in hardware acceleration and parallel computing.  They can be used for nearly all tasks that traditionally use digital processors including image/video processing, signal processing, data encryption/decryption, and many other computation-related tasks.

The flexibility of configuring different computing units with an FPGA makes it possible to build any type of system ranging from conventional GPUs to systems with numerous VPUs without causing any memory bottlenecks since the PCIe bus would have greater bandwidth than the PCI bus.  These limitations have been addressed in Nvidia's latest version called the Tesla V100 which boasts a 2304-bit memory bus but is still not enough to keep up with FPGAs.

FPGAs are mostly used for 3d graphics processing, parallel computing, and image recognition/reconstruction algorithms such as deep learning networks in fields such as autonomous driving applications using image data from cameras mounted on vehicles or robots.

What are the three basic elements of FPGA?

Three components are programmable on FPGA i.e. static RAM, anti-fuses, and flash erasable programmable read-only memory (EPROM).   These elements are linked together to create a huge arrangement of logic blocks. These blocks are further connected via programmable synapsis (interconnects). Since the configuration procedure (programming) is performed by hardware engineers, not software developers as is the case with ASICs and GPUs, FPGAs offer a significant reduction in manufacturing costs over custom ASIC chips.

Applications of FPGA chips

In 2016, the automotive industry was a large spender in FPGA hardware and software. The market research firm called VDC predicts that from 2016 to 2021, the global revenue will grow more than 8% reaching $4 billion by 2021. Below are some examples of how they are used:

  1. Driver assistance systems including sensors for collision avoidance or automatic braking can be deployed using an FPG encoding algorithm for real-time analysis of data captured by cameras on moving vehicles or robots in industrial processes such as welding applications where objects may come into contact with each other at a high rate of speed thus causing damage/injury to human workers along with equipment failure.
  2. General-purpose computing languages can be deployed on FPGA hardware using a host computer and operating system such as Linux to build image recognition models for cars or robots along with software algorithms that can be used in 3D graphics, parallel computing, etc.
  3. Networking applications such as network security, encryption/decryption of data packets, and online gaming platforms are best suited for FPGA implementation due to the small latency issues related to network traffic utilizing their flexible capabilities for parallel computing.  One major example would include the Atlas platform which was developed by Facebook engineers using an FPGA architecture to improve their deep learning (ML) systems while leaving GPU-powered machines with other tasks than just training models & serving end-users.
  4. The use of FPGAs allows cloud-based machine learning providers such as Google or Facebook to implement their data center infrastructure using them and train models at a much faster rate while retaining the same latency levels for end-users.
  5. Image processing software can be ported to run on FPGAs which are used for industrial cameras that camera companies such as PIxel and Arrow use.  They claim that using an FPGA-powered architecture allows them to create a higher level of performance per dollar spent than their competitors who utilize GPU hardware.


Flexibility: FPGAs offer a high level of flexibility in configuring the hardware modules during development. Moreover, unlike GPUs which have pre-determined cores for their arithmetic logic units (ALUs), FPGA's ALU can be configured to handle an array of parallel tasks making it possible to build systems with multiple VPUs or any combination of both capabilities.

Hardware acceleration/parallel computing: FPGAs are built on field-programmable gate arrays hence they offer greater flexibility when compared to GPUs and CPUs wherein they can easily add additional processing units whenever needed without having to be concerned about the memory bus being unbalanced. As a result, this makes them ideal for deep learning networks that require intensive number crunching as they can easily add additional ALUs without having to re-program the existing ones.

Clock Speed: FPGAs have faster clock speeds compared to GPUs which results in increased performance.  For a comparable number of cores, FPGAs perform significantly better than GPUs for applications that are I/O intensive such as communication networks using deep learning algorithms for pattern recognition.

Cloud Computing: FPGAs make it easier to build customized hardware that can be easily configured based on the requirements of clients and users.  This makes them suitable for cloud computing platforms where the topology can change based on requirements at any given time without incurring additional costs or cumbersome development timelines.

Like any other technology shift, there will always be some drawbacks:   Cost vs Flexibility. Since FPGA's are programmable, they require a more versatile development environment and hence are usually more expensive than GPUs. In addition, FPGAs do not allow for reconfiguration in production that can be easily done with GPUs.

Current Limitations

FPGAs are limited by their computational capability & memory bandwidth and thus have slower clock speeds due to the need for bulky SRAMs (Static Random Access Memory) which is required by deep learning networks that use a large number of weights.  This means that FPGA's will typically be used in smaller clusters with limited ones being used in high-end applications such as autonomous vehicles & drones etc.  However, there is also an advantage to this limitation since it makes FPGAs cheaper compared to GPUs/CPUs when not using them for DNN computations.

Lack of speed - due to their reliance on external memory, they are slower than conventional microprocessors or GPUs when compared per clock cycle thus not suitable for computationally intensive simulations that require real-time results such as high-resolution camera processing algorithms used in automotive applications.

The above examples give us an insight into the agility of these new architectures that are being designed to meet the ever-growing demands of deep learning networks in autonomous cars, drones, and robots – they also highlight the use of GPGPUs for accelerating network computations.  GPUs have a huge advantage over CPUs especially when it comes to performing parallel computing tasks while FPGAs hold many advantages compared to GPUs mainly because they can be re-configured after manufacturing hence providing great flexibility when designing any type of system.

However, just like any other advancement in technology; as more capabilities are built into each architecture, future generations will come along with even greater advancements enabling faster clock speeds and larger memory buses thereby making it difficult for their competitors to catch up

What companies are providing FPGAs?

The market leaders include Xilinx and Altera who are both major suppliers of FPGA programming tools used by engineers to configure devices based on their requirements.

Xilinx and Altera are the two biggest suppliers of FPGAs with IBM claiming that they will be using their SoftLayer cloud to provide GPUs and FPGAs.  Microsoft Azure and Amazon Web Services also allow their users to select between GPUs & FPGAs in their cloud computing platforms.

Mobileye, a major supplier of lane departure warning systems for vehicles, is using Intel's Xeon processors along with an Altera field-programmable gate array (FPGA) co-processor to power its cameras.  Intel acquired Mobileye in 2017 for $15.3 Billion – it remains unclear how significant the role these accelerators play in autonomous driving projects but it hints towards a future where automation plays a significant role in both developed and emerging automobile markets.

Mobileye's co-processor is programmed to handle the data processing required by advanced driver assistance systems (ADAS) – it can process high dimensional input information from Mobileye's EyeQ3 sensor while also reducing processor loads which results in a direct reduction of power consumption.  The combined solution reduces latency to around 240ms, which is considered a major improvement since most cameras provide an average latency of 1 second.

3. Application-Specific Integrated Circuit (ASIC)

ASICs are typically designed for a single application or purpose, and cannot be re-programmed like FPGA's or GPUs.   This makes them a more efficient platform when compared to FPGAs & GPUs since they can be easily constructed for specific tasks – these platforms are ideal for use cases such as trading, gaming, and even cryptocurrency mining.

ASICs have gained popularity in recent years with major technology companies like Intel & IBM using an ASIC-based system to power their cloud computing platform.

Unlike FPGAs & GPUs which are designed to handle both computing as well as graphics computations, ASICs are used specifically for computations that require high performance such as those demanded by cryptocurrency miners.  ASICs have better electrical characteristics than FPGAs and thus can provide higher computation speeds while also being cheaper to produce making them the preferred choice in cases where money is not an issue – they usually require very little external memory hence their reliance on blockchains that store immense amounts of data in small packages.

Nervana is an ASIC built by Intel that is based on a new type of architecture called deep learning ternary content-addressable memory (TCAM) which provides the neural network layer with a very high throughput via its three times per clock boost.  This allows Nervana to deliver more performance than GPUs at significantly lower prices, which could make it possible for everyday consumers to get their hands on an AI accelerator at an affordable cost.

Intel's acquisition of Nervana Systems in August 2016 provided the company with considerable expertise in training and inferencing algorithms that are now being leveraged in their industry-leading Movidius vision processing units (VPUs).  They have since then released VPU products for both industrial as well as consumer use cases – this has led to Nervana's technology being integrated into Intel's RealSense Depth Cameras, Xeon & Core processors, and their newly released Movidius Neural Compute Stick.

Advantages of ASICs

ASICs can be superior to FPGAs in terms of performance since they have lower latency and offer better electrical characteristics which are why Bitcoin ASICs are so powerful.  They also provide the highest levels of security, power efficiency, and flexibility when compared with GPUs & FPGAs because they can be molded to perform any task that fits their design specifications.

Disadvantages of ASICs

The main drawback of ASICs is that their manufacturing requires enormous capital investments – this leads many companies to rely on GPUs or FPGAs which require less initial funding and can still provide adequate rates of blockchain mining as long as there are no major advances in developing more efficient digital currencies hence its reliance on cryptocurrencies such as Bitcoin for achieving financial gain.

However, it remains unclear whether the use of ASICs is only for mining, or whether they are being used for other applications such as providing a competitive advantage over GPUs & FPGAs.  While cloud computing providers can provide incentives to users who select their platform based on expectations of earning a return on investment from performing blockchain mining, it would make little sense to invest in developing ASICs if there are no cryptocurrencies that require such specialized hardware to achieve maximum efficiency.

This makes it hard for cryptocurrency miners using GPU's and FPGAs to understand what factors will drive future profitability since most cryptocurrencies remain unpredictable concerning how quickly they respond to technological innovations.

4. What is Tensor Processing Unit (TPU)?

A tensor processing unit (TPU) is made by Google to accelerate machine learning applications. It is designed to run on TensorFlow, constructed with multiple processing primitives called tensors. Tensors are generalizations of vectors and matrices to potentially higher dimensions.

Features of Google's TPU technology

Google claims that a 64-bit, 80 teraflops variant of its second-generation Maxwell architecture delivers energy efficiency up to 9 times more efficient than a general-purpose CPU. The architecture also includes specific support for deep learning inference, with data throughput that is 2x to 3x higher than the first generation of TPUs used in Google's data centers.

It is capable of eight mixed-precision operations per clock cycle, where each operation is performed at 16-bit floating-point accuracy and in some cases with 24-bit integer or 32-bit floating-point granularity.  This mixed-precision supported by Google is similar to half (16 bits) Gaussian rounding but different from Intel's full (24 bits) round-to-nearest mode supported by Xeon Phi coprocessors.

TPU derives its computing power from neural networks which are used to provide the most accurate language and image recognition as well as parsing of structured data in real-time. It was designed for inferencing, a step that involves activation of pre-trained ML models and is usually more computationally intensive than training.

What is the capacity of a Tensor Processing Unit (TPU)?

Google states that its second-generation TPU can perform inference at 4,500 images per second (for ResNet-50), a workload for which it would take 16 high-end Nvidia K80 GPUs to match the performance of one Google TPU.  Google further claims that its 32 teraflops variant of the new TPU architecture provides 6x higher performance than the first-generation TPUs.

What companies are using Tensor Processing Unit?

Several companies have deployed Tensor Processing Units in their data centers including e-commerce giant Alibaba as well as search engine giants Baidu and Google (Alphabet).

Intel also announced its first TPU design called Lake Crest to power Deep Learning workloads for manufacturing, healthcare, finance & services industries.

Now, as I have promised to mention a different kind of processor that mimics the human brain for functionality. This is called a “neuromorphic processor”.

Can I buy a Tensor Processing Unit?

No! You cannot. TPU is the sole property of Google that is not yet for sale. Though you can rent Google TPU as Google Cloud Services with the cost of $1.35 per machine per hour.

5. Neuromorphic Processor

What is a neuromorphic Processor?

Neuromorphic processors are designed to as closely as possible resemble the human brain in their architecture and operation.  This can be achieved by using analog circuits that conduct computations similar to those conducted by neurons in the human brain which allows them to perform a complex set of operations with high amounts of memory, albeit at low operating speeds.

In the past few years, these circuits have been designed to run deep learning algorithms that were originally developed for ASICs and other forms of traditional processing hardware. This allows the neuromorphic processors to deliver higher efficiency when operating on AI applications compared to their predecessors while also providing lower power consumption thereby making them a more cost-effective option than GPUs & FPGAs.

Who invented Neuromorphic Processor?

The concept of neuromorphic processors was pioneered by Carver Mead, a Caltech professor who has been working on the development of circuits that would mimic the human brain since 1979.

Advantages & Disadvantages of Neuromorphic Processor

The key advantage of neuromorphic processors is their ability to deliver high levels of performance to AI applications at only a fraction of the energy cost that would have been required by traditional processors.  They are also highly scalable and can be incorporated into multiple computational devices, including mobile phones & other handheld devices as well as in-field computers that can deliver high levels of performance with minimal efforts.

However, neuromorphic processors remain inferior to GPUs & FPGAs in terms of efficiency and performance when executing standard financial & mathematical operations. In addition, their limited scalability requires them to be incorporated into other devices for their effective operation which also increases start-up costs.

What is the processing power of neuromorphic processors?

A neuromorphic processor is faster than Google's TPU, processing 100 times as many frames in a second while using 10,000 times less energy.  Both processors were tested on Atari game Q*bert , and Neuromorphic won by a score of 1 million to 14 thousand.

The IBM TrueNorth

The IBM TrueNorth currently holds the distinction of being the most powerful neuromorphic processor with a capacity of 1.02 TFLOPs at only 10 Watts, which is approximately 7,200 times less power than would have been required by an Intel Core i7-7700K to achieve a similar level of performance.

Final Words

With the introduction of the latest AI accelerators that may cut cost, energy consumption, and data processing time, the worry of limited processing power is fading away. As with neuromorphic computers, the highest feasible processing model for computing can be replicated from brain functioning. To gain optimum processing power, intelligently designed circuits, efficient software code, and simpler algorithms are essential.