CUDA and Tensor cores are products developed by a company called Nvidia. So what are CUDA cores and Tensor cores? CUDA stands for Compute Unified Device Architecture. The CUDA cores are present in your GPUs, smartphones, and even your cars, as the Nvidia developers say so.

CUDA cores are a parallel computing platform and application programming interface (API) that enables software to make use of specific types of graphics processing units (GPUs) for general-purpose processing.

Whereas tensor cores which were also developed by Nvidia, are also used in GPUs. Tensor Cores enable mixed-precision computing, adapting calculations dynamically to increase throughput while maintaining accuracy.

In simple words, these cores are an important part of the GPUs in your pc to do certain calculations. CUDA cores are used to multiply two numbers and add them to another number.

Whereas Tensor core is the same but with 4×4 matrices. These calculations are basically rendering graphics faster for you.

## What Is CUDA?

**Compute Unified Device Architecture in short CUDA developed by Nvidia, released on June 23, 2007, is a parallel computing platform and application programming interface (API).**

That enables software to use specific types of graphics processing units (GPUs) for general-purpose processing, a method known as general-purpose computing on GPUs (GPU).

CUDA is a software layer that provides direct access to the GPU’s virtual instruction set and parallel computational elements for the execution of compute kernels. CUDA was developed to work with different programming languages including C, C++, and Fortran.

The ability to work with different programming languages makes it easier for specialists in parallel programming to make use of GPU resources if we differentiate it from prior APIs such as Direct3D or OpenGL, which would require you to have a more advanced skill base in graphical programming.

GPU with CUDA also supports programming frameworks, such as OpenMP, OpenACC, OpenCL, and also HIP which can compile such code to CUDA. The first name used for CUDA was an acronym for Compute Unified Device Architecture. However, Nvidia later dropped the commonly used acronym.

### More About CUDA

As a specialized computer processor, the graphics processing unit (GPU) meets the needs of real-time, compute-intensive 3D graphics workloads.

About 2012 GPUs evolved and had become highly parallel multi-core systems enabling effective data processing for big blocks.

When processing huge blocks of data in parallel, this design is superior to general-purpose central processing units (CPUs) for algorithms, such as:

- cryptographic hash functions
- machine learning
- molecular dynamics simulations
- physics engines
- sort algorithms

### Uses of the CUDA Architecture Now and in the Future

- Accelerated rendering of 3D graphics
- Accelerated interconversion of video file formats
- Accelerated encryption, decryption, and compression
- Bioinformatics, e.g., NGS DNA sequencing BarraCUDA
- Distributed calculations, such as predicting the native conformation of proteins
- Medical analysis simulations, for example, virtual reality based on CT and MRI scan images
- Physical simulations, in particular in fluid dynamics
- Neural network training in machine learning problems
- Face recognition
- Distributed computing projects, such as [email protected] and other projects using BOINC
- Molecular dynamics
- Mining cryptocurrencies
- Structure from motion (SfM) software

## What Is a Tensor Core?

**Specialized cores called Tensor Cores allow for mixed-precision training. These specialized cores’ initial generation does this with a fused multiply-add algorithm. This makes it possible to multiply and add two 4 x 4 FP16 matrices to a 4 x 4 FP16 or FP32 matrix. **

The ultimate result will be FP32 with only a slight loss of precision, mixed precision computing is designated as such even though the input matrices may be low-precision FP16.

In practice, this significantly speeds up the calculations with little influence on the model’s final effectiveness. This capacity has been expanded by later microarchitectures to even less precise computer number representations.

The first generation was introduced with Volta microarchitecture starting at V100, more computer number precision formats were made available for computation with new GPU microarchitectures with each passing generation.

We’ll talk about how Tensor Cores’ capacity and functionality have changed and improved with each microarchitecture generation in the section that follows.

## How do Tensor Cores Work?

### First Generation:

The Volta GPU microarchitecture was included with the first generation of Tensor Cores. These cores made it possible to train with mixed precision and the FP16 number format.

This could have up to a 12x boost in teraFLOP throughput for certain GPUs. The 640 cores of the top-tier V100 give up to a 5x increase in performance speed over the Pascal GPUs of the previous generation.

### Second Generation:

With the introduction of Turing GPUs, the second generation of Tensor Cores was introduced. Int8, Int4, and Int1 were added to the list of supported Tensor Core precisions, which were previously limited to FP16.

Due to mixed precision training procedures, the GPU’s performance throughput was increased by up to 32 times compared to Pascal GPUs.

### Third Generation:

The architecture in an Ampere GPU expands on the Volta and Turing microarchitectures’ earlier advancements by adding support for FP64, TF32, and bfloat16 precisions.

Deep learning training and inference activities are accelerated much more by these extra precision formats. For instance, the TF32 format functions similarly to FP32 while also guaranteeing up to 20x speedups without altering any code.

Then, with just a few lines of code, automatic mixed precision implementation will speed up training by an additional 2x.

Third-generation NVLink to enable blazingly fast multi-GPU interactions, third-generation Ray Tracing cores, and specialization with sparse matrix mathematics are additional aspects of the Ampere microarchitecture**.**

### Fourth Generation:

A future release of the Hopper microarchitecture-based fourth generation of Tensor Cores is planned. The fourth generation Tensor Cores in the next H100.

which is expected to be released in March 2022, will be able to handle FP8 precision formats and, according to NVIDIA, will accelerate huge language models “by an astonishing 30X over the previous generation.”

## The Difference Between CUDA Cores and Tensor Cores

**Tensor cores are currently limited to Titan V and Tesla V100. The 5120 CUDA cores on both GPUs have a maximum capacity of one single precision multiply-accumulate operation (for example, in fp32: x += y * z) per GPU clock (e.g. Tesla V100 PCIe frequency is 1.38Gz). **

Each tensor core operates on 4×4 small matrices for small matrices. Per one GPU clock, each tensor core can complete one matrix multiply-accumulate operation.

It multiplies two 4×4 FP16 matrices and adds the 4×4 FP32 matrix that results in the accumulator (that is also an fp32 4×4 matrix).

Because the input matrices are fp16 while the multiplication results and accumulator are fp32, the algorithm is known as mixed precision.

The correct term would likely be just “4×4 matrix cores,” but the NVIDIA marketing team chose to use “tensor cores.”

GPU card | CUDA cores | VRAM |
---|---|---|

GeForce GTX 1660 Ti | 1536 | 6GB |

GeForce GTX 1660 Super | 1408 | 6GB |

GeForce GTX 1660 | 1408 | 6GB |

GeForce GTX 1650 Super | 1408 | 4GB |

GeForce GTX 1650 | 1024 and 896 | 4GB |

GeForce GTX 1060 3GB | 1280 | 4GB |

GeForce GTX 1650 | 1280 | 3GB |

GeForce GTX 1060 6GB | 768 | 6GB |

GeForce GTX 1050 Ti (3GB) | 768 | 4GB |

GeForce GTX 1050 (2GB) | 640 | 3GB |

GeForce GTX 960 | 1024 | 2GB |

GeForce GTX 950 | 786 | 2GB |

GeForce GTX 780 Ti | 2880 | 2GB |

GeForce GTX 780 | 2304 | 3GB |

GeForce GTX 750 Ti | 640 | 2 GB |

GeForce GTX 750 | 512 | 1GB or 2 GB |

## Conclusion

- CUDA and Tensor cores are products, both developed by a company called Nvidia. CUDA stands for Compute Unified Device Architecture. These CUDA cores are present in your GPUs, smartphones, and even your cars.
- Whereas tensor cores, which were also developed by Nvidia, are also used in GPUs. Specialized cores called “Tensor cores” allow for mixed-precision training. The first generation of Tensor Cores made it possible to train with mixed precision and the FP16 number format.
- This could have up to a 12x boost in teraFLOP throughput for certain GPUs. Int8, Int4, and Int1 were added to the list of supported Tensor Core precisions.
- Due to mixed precision training procedures, the GPU’s performance was increased by up to 32 times. A future release of the Hopper microarchitecture-based fourth generation of Tensor Cores is planned.