What Is the Difference Between CUDA Cores and Tensor Cores? (Explained)

CUDA and Tensor cores are products developed by a company called Nvidia. So what are CUDA cores and Tensor cores? CUDA stands for Compute Unified Device Architecture. The CUDA cores are present in your GPUs, smartphones, and even your cars, as the Nvidia developers say so.

CUDA cores are a parallel computing platform and application programming interface (API) that enables software to make use of specific types of graphics processing units (GPUs) for general-purpose processing.

Whereas tensor cores which were also developed by Nvidia, are also used in GPUs. Tensor Cores enable mixed-precision computing, adapting calculations dynamically to increase throughput while maintaining accuracy.

In simple words, these cores are an important part of the GPUs in your pc to do certain calculations. CUDA cores are used to multiply two numbers and add them to another number.

Whereas Tensor core is the same but with 4×4 matrices. These calculations are basically rendering graphics faster for you.

Page Contents

What Is CUDA?

Compute Unified Device Architecture in short CUDA developed by Nvidia, released on June 23, 2007, is a parallel computing platform and application programming interface (API).

That enables software to use specific types of graphics processing units (GPUs) for general-purpose processing, a method known as general-purpose computing on GPUs (GPU).

CUDA is a software layer that provides direct access to the GPU’s virtual instruction set and parallel computational elements for the execution of compute kernels. CUDA was developed to work with different programming languages including C, C++, and Fortran.

The ability to work with different programming languages makes it easier for specialists in parallel programming to make use of GPU resources if we differentiate it from prior APIs such as Direct3D or OpenGL, which would require you to have a more advanced skill base in graphical programming.

GPU with CUDA also supports programming frameworks, such as OpenMP, OpenACC, OpenCL, and also HIP which can compile such code to CUDA. The first name used for CUDA was an acronym for Compute Unified Device Architecture. However, Nvidia later dropped the commonly used acronym.

A powerful Nvidia Graphics card GTX 1080 Ti

More About CUDA

As a specialized computer processor, the graphics processing unit (GPU) meets the needs of real-time, compute-intensive 3D graphics workloads.

About 2012 GPUs evolved and had become highly parallel multi-core systems enabling effective data processing for big blocks.

When processing huge blocks of data in parallel, this design is superior to general-purpose central processing units (CPUs) for algorithms, such as:

cryptographic hash functions
machine learning
molecular dynamics simulations
physics engines
sort algorithms

Uses of the CUDA Architecture Now and in the Future

Accelerated rendering of 3D graphics
Accelerated interconversion of video file formats
Accelerated encryption, decryption, and compression
Bioinformatics, e.g., NGS DNA sequencing BarraCUDA
Distributed calculations, such as predicting the native conformation of proteins
Medical analysis simulations, for example, virtual reality based on CT and MRI scan images
Physical simulations, in particular in fluid dynamics
Neural network training in machine learning problems
Face recognition
Distributed computing projects, such as SETI@home and other projects using BOINC
Molecular dynamics
Mining cryptocurrencies
Structure from motion (SfM) software

What Is a Tensor Core?

Specialized cores called Tensor Cores allow for mixed-precision training. These specialized cores’ initial generation does this with a fused multiply-add algorithm. This makes it possible to multiply and add two 4 x 4 FP16 matrices to a 4 x 4 FP16 or FP32 matrix.

The ultimate result will be FP32 with only a slight loss of precision, mixed precision computing is designated as such even though the input matrices may be low-precision FP16.

In practice, this significantly speeds up the calculations with little influence on the model’s final effectiveness. This capacity has been expanded by later microarchitectures to even less precise computer number representations.

The first generation was introduced with Volta microarchitecture starting at V100, more computer number precision formats were made available for computation with new GPU microarchitectures with each passing generation.

We’ll talk about how Tensor Cores’ capacity and functionality have changed and improved with each microarchitecture generation in the section that follows.

A Graphically rendered image made by a Titan V

How do Tensor Cores Work?

First Generation:

The Volta GPU microarchitecture was included with the first generation of Tensor Cores. These cores made it possible to train with mixed precision and the FP16 number format.

This could have up to a 12x boost in teraFLOP throughput for certain GPUs. The 640 cores of the top-tier V100 give up to a 5x increase in performance speed over the Pascal GPUs of the previous generation.

Second Generation:

With the introduction of Turing GPUs, the second generation of Tensor Cores was introduced. Int8, Int4, and Int1 were added to the list of supported Tensor Core precisions, which were previously limited to FP16.

Due to mixed precision training procedures, the GPU’s performance throughput was increased by up to 32 times compared to Pascal GPUs.

Third Generation:

The architecture in an Ampere GPU expands on the Volta and Turing microarchitectures’ earlier advancements by adding support for FP64, TF32, and bfloat16 precisions.

Deep learning training and inference activities are accelerated much more by these extra precision formats. For instance, the TF32 format functions similarly to FP32 while also guaranteeing up to 20x speedups without altering any code.

Then, with just a few lines of code, automatic mixed precision implementation will speed up training by an additional 2x.

Third-generation NVLink to enable blazingly fast multi-GPU interactions, third-generation Ray Tracing cores, and specialization with sparse matrix mathematics are additional aspects of the Ampere microarchitecture.

Fourth Generation:

A future release of the Hopper microarchitecture-based fourth generation of Tensor Cores is planned. The fourth generation Tensor Cores in the next H100.

which is expected to be released in March 2022, will be able to handle FP8 precision formats and, according to NVIDIA, will accelerate huge language models “by an astonishing 30X over the previous generation.”

An RTX graphics card is used for rendering graphics very fast as it contains tensor cores.

The Difference Between CUDA Cores and Tensor Cores

Tensor cores are currently limited to Titan V and Tesla V100. The 5120 CUDA cores on both GPUs have a maximum capacity of one single precision multiply-accumulate operation (for example, in fp32: x += y * z) per GPU clock (e.g. Tesla V100 PCIe frequency is 1.38Gz).

Each tensor core operates on 4×4 small matrices for small matrices. Per one GPU clock, each tensor core can complete one matrix multiply-accumulate operation.

It multiplies two 4×4 FP16 matrices and adds the 4×4 FP32 matrix that results in the accumulator (that is also an fp32 4×4 matrix).

Because the input matrices are fp16 while the multiplication results and accumulator are fp32, the algorithm is known as mixed precision.

The correct term would likely be just “4×4 matrix cores,” but the NVIDIA marketing team chose to use “tensor cores.”

Tensor cores full explanation in a nutshell

GPU card	CUDA cores	VRAM
GeForce GTX 1660 Ti	1536	6GB
GeForce GTX 1660 Super	1408	6GB
GeForce GTX 1660	1408	6GB
GeForce GTX 1650 Super	1408	4GB
GeForce GTX 1650	1024 and 896	4GB
GeForce GTX 1060 3GB	1280	4GB
GeForce GTX 1650	1280	3GB
GeForce GTX 1060 6GB	768	6GB
GeForce GTX 1050 Ti (3GB)	768	4GB
GeForce GTX 1050 (2GB)	640	3GB
GeForce GTX 960	1024	2GB
GeForce GTX 950	786	2GB
GeForce GTX 780 Ti	2880	2GB
GeForce GTX 780	2304	3GB
GeForce GTX 750 Ti	640	2 GB
GeForce GTX 750	512	1GB or 2 GB

GPUs that contain CUDA cores

Conclusion

CUDA and Tensor cores are products, both developed by a company called Nvidia. CUDA stands for Compute Unified Device Architecture. These CUDA cores are present in your GPUs, smartphones, and even your cars.
Whereas tensor cores, which were also developed by Nvidia, are also used in GPUs. Specialized cores called “Tensor cores” allow for mixed-precision training. The first generation of Tensor Cores made it possible to train with mixed precision and the FP16 number format.
This could have up to a 12x boost in teraFLOP throughput for certain GPUs. Int8, Int4, and Int1 were added to the list of supported Tensor Core precisions.
Due to mixed precision training procedures, the GPU’s performance was increased by up to 32 times. A future release of the Hopper microarchitecture-based fourth generation of Tensor Cores is planned.